Fuzzy Data Matching with SQL: Enhancing Data Quality and Query Performance

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

If you were handed two different but related sets of data, what tools would you use to find the matches? What if all you had was SQL SELECT access to a database? In this practical book, author Jim Lehmer provides best practices, techniques, and tricks to help you import, clean, match, score, and think about heterogeneous data using SQL.

DBAs, programmers, business analysts, and data scientists will learn how to identify and remove duplicates, parse strings, extract data from XML and JSON, generate SQL using SQL, regularize data and prepare datasets, and apply data quality and ETL approaches for finding the similarities and differences between various expressions of the same data.

Full of real-world techniques, the examples in the book contain working code. You'll learn how to:

  • Identity and remove duplicates in two different datasets using SQL
  • Regularize data and achieve data quality using SQL
  • Extract data from XML and JSON
  • Generate SQL...
  • Author(s): Jim Lehmer
    Publisher: O'Reilly Media
    Year: 2023

    Language: English
    Pages: 282

    Preface
    What Problems Are We Trying to Solve?
    What Will We Cover?
    Part I: Review
    Part II: Various Data Problems
    Part III: Bringing It Together
    Appendix
    Who Is This Book For?
    Why SQL?
    Warning! Opinions Ahead!
    Typographical Conventions Used in This Book
    Additional Information on the Book’s Conventions
    The Data “Model”
    Environment Layout
    Customer Table
    “Normalized” View
    Meet the Snedleys
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    I. Review
    1. A SELECT Review
    Simple SELECT Statements
    Common Table Expressions
    In CASE of Emergency
    Joins
    A Diversion into NULL Values
    OUTER JOINs
    Finding the Most Current Value
    Final Thoughts on SELECT
    2. Function Junction
    Aggregate Functions
    MAX
    MIN
    COUNT
    SUM
    AVG
    Conversion Functions
    CAST and CONVERT
    COALESCE
    TRY_CONVERT
    Cryptographic Functions: HASHBYTES
    Date and Time Functions
    GETDATE
    DATEADD
    DATEDIFF
    DATEPART
    ISDATE
    Logical Functions: IIF
    String Functions
    CHARINDEX and PATINDEX
    LEN
    LEFT, RIGHT, and SUBSTRING
    LTRIM, RTRIM, and TRIM
    LOWER and UPPER
    REPLACE and TRANSLATE
    REVERSE
    STRING_AGG
    System Functions
    ISNULL
    ISNUMERIC
    Final Thoughts on Functions
    II. Various Data Problems
    3. Names, Names, Names
    What’s in a Name?
    Last Names
    Punctuation
    Suffixes
    First Names
    Middle Name
    Nicknames
    Company Name
    Full Name
    “Person-Like Entities”
    Final Thoughts on Names
    4. Location, Location, Location
    What Makes an Address?
    Street Address
    Box, Suite, Lot, or Apartment Number
    Don’t Overdo It!
    City
    County
    State or State Abbreviation
    ZIP or Postal Code
    Country
    Final Thoughts on Locations
    5. Dates, Dates, Dates
    Time Is Relative
    Final Thoughts on Dates
    6. Email
    What Makes a Valid Email Address?
    Final Thoughts on Email
    7. Phone Numbers
    What Makes a “Phone Number”?
    One Final Note on Tax IDs
    Final Thoughts on Phone Numbers (and Tax IDs)
    8. Bad Characters
    Data Representations
    Invisible Whitespace
    COLLATE
    Cleaning Up the Input Data
    Final Thoughts on Bad Characters
    9. Orthogonal Data
    A Common Problem, A Common Solution, A New Common Problem
    Lather, Rinse, Repeat
    Final Thoughts on Orthogonal Data
    III. Bringing It Together
    10. The Big Score
    What Will We Want?
    Tuning Scores
    Eliminating Duplicates
    Duplicate Data
    Duplicated Data
    Final Thoughts on Scoring
    11. Data Quality, or GIGO
    Sneaking Data Quality In
    Impossible Data
    Simply Wrong
    Semantically Wrong
    ETL Your Way to Success
    Final Thoughts on Data Quality
    12. Tying It All Together
    Approach
    What’s the Score?
    First Pass: Naive Matching
    Second Pass: Normalizing Relations
    Impossible Data
    Now Let’s Normalize
    Third Pass: Score!
    What About Tuning?
    Final Thoughts on Practical Matters
    13. Code Is Data, Too!
    Working with XML Data
    Working with JSON Data
    Extracting Data from HTML
    Code-Generating Code
    Impact Analysis: The Second Case Study
    Gather Together Every Code “Artifact” You Can
    Import Artifacts into SQL
    And Now, for My Next Trick
    Final Thoughts on Code As Data
    Final Thoughts on All of It
    A. The Data “Model”
    Customer Table
    NormalizedCustomer View
    PotentialMatches Table
    CustomerCountByState View
    PostalAbbreviations Table
    Glossary
    Index