Automating Data Quality Monitoring: Going Deeper Than Data Observability

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The world's businesses ingest a combined 2.5 quintillion bytes of data every day. But how much of this vast amount of data--used to build products, power AI systems, and drive business decisions--is poor quality or just plain bad? This practical book shows you how to ensure that the data your organization relies on contains only high-quality records. Most data engineers, data analysts, and data scientists genuinely care about data quality, but they often don't have the time, resources, or understanding to create a data quality monitoring solution that succeeds at scale. In this book, Jeremy Stanley and Paige Schwartz from Anomalo explain how you can use automated data quality monitoring to cover all your tables efficiently, proactively alert on every category of issue, and resolve problems immediately. This book will help you: • Learn why data quality is a business imperative • Understand and assess unsupervised learning models for detecting data issues • Implement notifications that reduce alert fatigue and let you triage and resolve issues quickly • Integrate automated data quality monitoring with data catalogs, orchestration layers, and BI and ML systems • Understand the limits of automated data quality monitoring and how to overcome them • Learn how to deploy and manage your monitoring solution at scale • Maintain automated data quality monitoring for the long term

Author(s): Jeremy Stanley, Paige Schwartz
Edition: 1
Publisher: O'Reilly Media
Year: 2024

Language: English
Commentary: Publisher's PDF
Pages: 217
City: Sebastopol, CA
Tags: Machine Learning; Unsupervised Learning; Monitoring; Scalability; Automation; Alerting; Data Quality; Data Observability

Cover
Copyright
Table of Contents
Foreword
Preface
Who Should Use This Book
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. The Data Quality Imperative
High-Quality Data Is the New Gold
Data-Driven Companies Are Today’s Disrupters
Data Analytics Is Democratized
AI and Machine Learning Are Differentiators
Companies Are Investing in a Modern Data Stack
More Data, More Problems
Issues Inside the Data Factory
Data Migrations
Third-Party Data Sources
Company Growth and Change
Exogenous Factors
Why We Need Data Quality Monitoring
Data Scars
Data Shocks
Automating Data Quality Monitoring: The New Frontier
Chapter 2. Data Quality Monitoring Strategies and the Role of Automation
Monitoring Requirements
Data Observability: Necessary, but Not Sufficient
Traditional Approaches to Data Quality
Manual Data Quality Detection
Rule-Based Testing
Metrics Monitoring
Automating Data Quality Monitoring with Unsupervised Machine Learning
What Is Unsupervised Machine Learning?
An Analogy: Lane Departure Warnings
The Limits of Automation
A Four-Pillar Approach to Data Quality Monitoring
Chapter 3. Assessing the Business Impact of Automated Data Quality Monitoring
Assessing Your Data
Volume
Variety
Velocity
Veracity
Special Cases
Assessing Your Industry
Regulatory Pressure
AI/ML Risks
Data as a Product
Assessing Your Data Maturity
Assessing Benefits to Stakeholders
Engineers
Data Leadership
Scientists
Consumers
Conducting an ROI Analysis
Quantitative Measures
Qualitative Measures
Conclusion
Chapter 4. Automating Data Quality Monitoring with Machine Learning
Requirements
Sensitivity
Specificity
Transparency
Scalability
Nonrequirements
Data Quality Monitoring Is Not Outlier Detection
ML Approach and Algorithm
Data Sampling
Feature Encoding
Model Development
Model Explainability
Putting It Together with Pseudocode
Other Applications
Conclusion
Chapter 5. Building a Model That Works on Real-World Data
Data Challenges and Mitigations
Seasonality
Time-Based Features
Chaotic Tables
Updated-in-Place Tables
Column Correlations
Model Testing
Injecting Synthetic Issues
Benchmarking
Improving the Model
Conclusion
Chapter 6. Implementing Notifications While Avoiding Alert Fatigue
How Notifications Facilitate Data Issue Response
Triage
Routing
Resolution
Documentation
Taking Action Without Notifications
Anatomy of a Notification
Visualization
Actions
Text Description
Who Created/Last Edited the Check
Delivering Notifications
Notification Audience
Notification Channels
Notification Timing
Avoiding Alert Fatigue
Scheduling Checks in the Right Order
Clustering Alerts Using Machine Learning
Suppressing Notifications
Automating the Root Cause Analysis
Conclusion
Chapter 7. Integrating Monitoring with Data Tools and Systems
Monitoring Your Data Stack
Data Warehouses
Integrating with Data Warehouses
Security
Reconciling Data Across Multiple Warehouses
Data Orchestrators
Integrating with Orchestrators
Data Catalogs
Integrating with Catalogs
Data Consumers
Analytics and BI Tools
MLOps
Conclusion
Chapter 8. Operating Your Solution at Scale
Build Versus Buy
Vendor Deployment Models
Configuration
Determining Which Tables Are Most Important
Deciding What Data in a Table to Monitor
Configuration at Scale
Enablement
User Roles and Permissions
Onboarding, Training, and Support
Improving Data Quality Over Time
Initiatives
Metrics
From Chaos to Clarity
Appendix A. Types of Data Quality Issues
Table Issues
Late Arrival
Schema Changes
Untraceable Changes
Row Issues
Incomplete Rows
Duplicate Rows
Temporal Inconsistency
Value Issues
Missing Values
Incorrect Values
Invalid Values
Multi Issues
Relational Failures
Inconsistent Sources
Index
About the Authors
Colophon