If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and think about heterogeneous data using SQL.
DBAs, programmers, business analysts, and data scientists will learn how to identify and remove duplicates, parse strings, extract data from XML and JSON, generate SQL using SQL, regularize data and prepare datasets, and apply data quality and ETL approaches for finding the similarities and differences between various expressions of the same data.
Full of real-world techniques, the examples in the book contain working code. You'll learn how to:
Identity and remove duplicates in two different datasets using SQLRegularize data and achieve data quality using SQLExtract data from XML and JSONGenerate SQL...
Author(s): Jim Lehmer
Publisher: O'Reilly Media
Year: 2023
Language: English
Pages: 282
Preface
What Problems Are We Trying to Solve?
What Will We Cover?
Part I: Review
Part II: Various Data Problems
Part III: Bringing It Together
Appendix
Who Is This Book For?
Why SQL?
Warning! Opinions Ahead!
Typographical Conventions Used in This Book
Additional Information on the Book’s Conventions
The Data “Model”
Environment Layout
Customer Table
“Normalized” View
Meet the Snedleys
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
I. Review
1. A SELECT Review
Simple SELECT Statements
Common Table Expressions
In CASE of Emergency
Joins
A Diversion into NULL Values
OUTER JOINs
Finding the Most Current Value
Final Thoughts on SELECT
2. Function Junction
Aggregate Functions
MAX
MIN
COUNT
SUM
AVG
Conversion Functions
CAST and CONVERT
COALESCE
TRY_CONVERT
Cryptographic Functions: HASHBYTES
Date and Time Functions
GETDATE
DATEADD
DATEDIFF
DATEPART
ISDATE
Logical Functions: IIF
String Functions
CHARINDEX and PATINDEX
LEN
LEFT, RIGHT, and SUBSTRING
LTRIM, RTRIM, and TRIM
LOWER and UPPER
REPLACE and TRANSLATE
REVERSE
STRING_AGG
System Functions
ISNULL
ISNUMERIC
Final Thoughts on Functions
II. Various Data Problems
3. Names, Names, Names
What’s in a Name?
Last Names
Punctuation
Suffixes
First Names
Middle Name
Nicknames
Company Name
Full Name
“Person-Like Entities”
Final Thoughts on Names
4. Location, Location, Location
What Makes an Address?
Street Address
Box, Suite, Lot, or Apartment Number
Don’t Overdo It!
City
County
State or State Abbreviation
ZIP or Postal Code
Country
Final Thoughts on Locations
5. Dates, Dates, Dates
Time Is Relative
Final Thoughts on Dates
6. Email
What Makes a Valid Email Address?
Final Thoughts on Email
7. Phone Numbers
What Makes a “Phone Number”?
One Final Note on Tax IDs
Final Thoughts on Phone Numbers (and Tax IDs)
8. Bad Characters
Data Representations
Invisible Whitespace
COLLATE
Cleaning Up the Input Data
Final Thoughts on Bad Characters
9. Orthogonal Data
A Common Problem, A Common Solution, A New Common Problem
Lather, Rinse, Repeat
Final Thoughts on Orthogonal Data
III. Bringing It Together
10. The Big Score
What Will We Want?
Tuning Scores
Eliminating Duplicates
Duplicate Data
Duplicated Data
Final Thoughts on Scoring
11. Data Quality, or GIGO
Sneaking Data Quality In
Impossible Data
Simply Wrong
Semantically Wrong
ETL Your Way to Success
Final Thoughts on Data Quality
12. Tying It All Together
Approach
What’s the Score?
First Pass: Naive Matching
Second Pass: Normalizing Relations
Impossible Data
Now Let’s Normalize
Third Pass: Score!
What About Tuning?
Final Thoughts on Practical Matters
13. Code Is Data, Too!
Working with XML Data
Working with JSON Data
Extracting Data from HTML
Code-Generating Code
Impact Analysis: The Second Case Study
Gather Together Every Code “Artifact” You Can
Import Artifacts into SQL
And Now, for My Next Trick
Final Thoughts on Code As Data
Final Thoughts on All of It
A. The Data “Model”
Customer Table
NormalizedCustomer View
PotentialMatches Table
CustomerCountByState View
PostalAbbreviations Table
Glossary
Index