Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance

Share
Add to Wishlist
PDF

Computers\\Programming: Programming Languages

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Download

Search on Amazon

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and think about heterogeneous data using SQL.

DBAs, programmers, business analysts, and data scientists will learn how to identify and remove duplicates, parse strings, extract data from XML and JSON, generate SQL using SQL, regularize data and prepare datasets, and apply data quality and ETL approaches for finding the similarities and differences between various expressions of the same data.

Full of real-world techniques, the examples in the book contain working code. You'll learn how to:

Identity and remove duplicates in two different datasets using SQL

Regularize data and achieve data quality using SQL

Extract data from XML and JSON

Generate SQL...

Author(s): Jim Lehmer
Publisher: O'Reilly Media
Year: 2023

Language: English
Pages: 282

Preface
What Problems Are We Trying to Solve?
What Will We Cover?
Part I: Review
Part II: Various Data Problems
Part III: Bringing It Together
Appendix
Who Is This Book For?
Why SQL?
Warning! Opinions Ahead!
Typographical Conventions Used in This Book
Additional Information on the Book’s Conventions
The Data “Model”
Environment Layout
Customer Table
“Normalized” View
Meet the Snedleys
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
I. Review
1. A SELECT Review
Simple SELECT Statements
Common Table Expressions
In CASE of Emergency
Joins
A Diversion into NULL Values
OUTER JOINs
Finding the Most Current Value
Final Thoughts on SELECT
2. Function Junction
Aggregate Functions
MAX
MIN
COUNT
SUM
AVG
Conversion Functions
CAST and CONVERT
COALESCE
TRY_CONVERT
Cryptographic Functions: HASHBYTES
Date and Time Functions
GETDATE
DATEADD
DATEDIFF
DATEPART
ISDATE
Logical Functions: IIF
String Functions
CHARINDEX and PATINDEX
LEN
LEFT, RIGHT, and SUBSTRING
LTRIM, RTRIM, and TRIM
LOWER and UPPER
REPLACE and TRANSLATE
REVERSE
STRING_AGG
System Functions
ISNULL
ISNUMERIC
Final Thoughts on Functions
II. Various Data Problems
3. Names, Names, Names
What’s in a Name?
Last Names
Punctuation
Suffixes
First Names
Middle Name
Nicknames
Company Name
Full Name
“Person-Like Entities”
Final Thoughts on Names
4. Location, Location, Location
What Makes an Address?
Street Address
Box, Suite, Lot, or Apartment Number
Don’t Overdo It!
City
County
State or State Abbreviation
ZIP or Postal Code
Country
Final Thoughts on Locations
5. Dates, Dates, Dates
Time Is Relative
Final Thoughts on Dates
6. Email
What Makes a Valid Email Address?
Final Thoughts on Email
7. Phone Numbers
What Makes a “Phone Number”?
One Final Note on Tax IDs
Final Thoughts on Phone Numbers (and Tax IDs)
8. Bad Characters
Data Representations
Invisible Whitespace
COLLATE
Cleaning Up the Input Data
Final Thoughts on Bad Characters
9. Orthogonal Data
A Common Problem, A Common Solution, A New Common Problem
Lather, Rinse, Repeat
Final Thoughts on Orthogonal Data
III. Bringing It Together
10. The Big Score
What Will We Want?
Tuning Scores
Eliminating Duplicates
Duplicate Data
Duplicated Data
Final Thoughts on Scoring
11. Data Quality, or GIGO
Sneaking Data Quality In
Impossible Data
Simply Wrong
Semantically Wrong
ETL Your Way to Success
Final Thoughts on Data Quality
12. Tying It All Together
Approach
What’s the Score?
First Pass: Naive Matching
Second Pass: Normalizing Relations
Impossible Data
Now Let’s Normalize
Third Pass: Score!
What About Tuning?
Final Thoughts on Practical Matters
13. Code Is Data, Too!
Working with XML Data
Working with JSON Data
Extracting Data from HTML
Code-Generating Code
Impact Analysis: The Second Case Study
Gather Together Every Code “Artifact” You Can
Import Artifacts into SQL
And Now, for My Next Trick
Final Thoughts on Code As Data
Final Thoughts on All of It
A. The Data “Model”
Customer Table
NormalizedCustomer View
PotentialMatches Table
CustomerCountByState View
PostalAbbreviations Table
Glossary
Index

Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance

My Account

Infomation

Talk To Us

Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance

How to Download?

Is it Free?

Book is not loading. What to do?