Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to these questions, this book is for you.
Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck, from the data observability company Monte Carlo, explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies.
• Build more trustworthy and reliable data pipelines
• Write scripts to make data checks and identify broken pipelines with data observability
• Learn how to set and maintain data SLAs, SLIs, and SLOs
• Develop and lead data quality initiatives at your company
• Learn how to treat data services and systems with the diligence of production software
• Automate data lineage graphs across your data ecosystem
• Build anomaly detectors for your critical data assets
Author(s): Barr Moses, Lior Gavish, Molly Vorwerck
Edition: 1
Publisher: O'Reilly Media
Year: 2022
Language: English
Commentary: Publisher's PDF
Pages: 308
City: Sebastopol, CA
Tags: Management; Anomaly Detection; Monitoring; Scalability; Data Cleaning; Data Lake; Data Warehouse; Data Collection; Data Ingestion; Apache Airflow; Data Quality; Data Normalization; Data Reliability
Cover
Copyright
Table of Contents
Preface
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Why Data Quality Deserves Attention—Now
What Is Data Quality?
Framing the Current Moment
Understanding the “Rise of Data Downtime”
Other Industry Trends Contributing to the Current Moment
Summary
Chapter 2. Assembling the Building Blocks of a Reliable Data System
Understanding the Difference Between Operational and Analytical Data
What Makes Them Different?
Data Warehouses Versus Data Lakes
Data Warehouses: Table Types at the Schema Level
Data Lakes: Manipulations at the File Level
What About the Data Lakehouse?
Syncing Data Between Warehouses and Lakes
Collecting Data Quality Metrics
What Are Data Quality Metrics?
How to Pull Data Quality Metrics
Using Query Logs to Understand Data Quality in the Warehouse
Using Query Logs to Understand Data Quality in the Lake
Designing a Data Catalog
Building a Data Catalog
Summary
Chapter 3. Collecting, Cleaning, Transforming, and Testing Data
Collecting Data
Application Log Data
API Responses
Sensor Data
Cleaning Data
Batch Versus Stream Processing
Data Quality for Stream Processing
Normalizing Data
Handling Heterogeneous Data Sources
Schema Checking and Type Coercion
Syntactic Versus Semantic Ambiguity in Data
Managing Operational Data Transformations Across AWS Kinesis and Apache Kafka
Running Analytical Data Transformations
Ensuring Data Quality During ETL
Ensuring Data Quality During Transformation
Alerting and Testing
dbt Unit Testing
Great Expectations Unit Testing
Deequ Unit Testing
Managing Data Quality with Apache Airflow
Scheduler SLAs
Installing Circuit Breakers with Apache Airflow
SQL Check Operators
Summary
Chapter 4. Monitoring and Anomaly Detection for Your Data Pipelines
Knowing Your Known Unknowns and Unknown Unknowns
Building an Anomaly Detection Algorithm
Monitoring for Freshness
Understanding Distribution
Building Monitors for Schema and Lineage
Anomaly Detection for Schema Changes and Lineage
Visualizing Lineage
Investigating a Data Anomaly
Scaling Anomaly Detection with Python and Machine Learning
Improving Data Monitoring Alerting with Machine Learning
Accounting for False Positives and False Negatives
Improving Precision and Recall
Detecting Freshness Incidents with Data Monitoring
F-Scores
Does Model Accuracy Matter?
Beyond the Surface: Other Useful Anomaly Detection Approaches
Designing Data Quality Monitors for Warehouses Versus Lakes
Summary
Chapter 5. Architecting for Data Reliability
Measuring and Maintaining High Data Reliability at Ingestion
Measuring and Maintaining Data Quality in the Pipeline
Understanding Data Quality Downstream
Building Your Data Platform
Data Ingestion
Data Storage and Processing
Data Transformation and Modeling
Business Intelligence and Analytics
Data Discovery and Governance
Developing Trust in Your Data
Data Observability
Measuring the ROI on Data Quality
How to Set SLAs, SLOs, and SLIs for Your Data
Case Study: Blinkist
Summary
Chapter 6. Fixing Data Quality Issues at Scale
Fixing Quality Issues in Software Development
Data Incident Management
Incident Detection
Response
Root Cause Analysis
Resolution
Blameless Postmortem
Incident Response and Mitigation
Establishing a Routine of Incident Management
Why Data Incident Commanders Matter
Case Study: Data Incident Management at PagerDuty
The DataOps Landscape at PagerDuty
Data Challenges at PagerDuty
Using DevOps Best Practices to Scale Data Incident Management
Summary
Chapter 7. Building End-to-End Lineage
Building End-to-End Field-Level Lineage for Modern Data Systems
Basic Lineage Requirements
Data Lineage Design
Parsing the Data
Building the User Interface
Case Study: Architecting for Data Reliability at Fox
Exercise “Controlled Freedom” When Dealing with Stakeholders
Invest in a Decentralized Data Team
Avoid Shiny New Toys in Favor of Problem-Solving Tech
To Make Analytics Self-Serve, Invest in Data Trust
Summary
Chapter 8. Democratizing Data Quality
Treating Your “Data” Like a Product
Perspectives on Treating Data Like a Product
Convoy Case Study: Data as a Service or Output
Uber Case Study: The Rise of the Data Product Manager
Applying the Data-as-a-Product Approach
Building Trust in Your Data Platform
Align Your Product’s Goals with the Goals of the Business
Gain Feedback and Buy-in from the Right Stakeholders
Prioritize Long-Term Growth and Sustainability Versus Short-Term Gains
Sign Off on Baseline Metrics for Your Data and How You Measure Them
Know When to Build Versus Buy
Assigning Ownership for Data Quality
Chief Data Officer
Business Intelligence Analyst
Analytics Engineer
Data Scientist
Data Governance Lead
Data Engineer
Data Product Manager
Who Is Responsible for Data Reliability?
Creating Accountability for Data Quality
Balancing Data Accessibility with Trust
Certifying Your Data
Seven Steps to Implementing a Data Certification Program
Case Study: Toast’s Journey to Finding the Right Structure for Their Data Team
In the Beginning: When a Small Team Struggles to Meet Data Demands
Supporting Hypergrowth as a Decentralized Data Operation
Regrouping, Recentralizing, and Refocusing on Data Trust
Considerations When Scaling Your Data Team
Increasing Data Literacy
Prioritizing Data Governance and Compliance
Prioritizing a Data Catalog
Beyond Catalogs: Enforcing Data Governance
Building a Data Quality Strategy
Make Leadership Accountable for Data Quality
Set Data Quality KPIs
Spearhead a Data Governance Program
Automate Your Lineage and Data Governance Tooling
Create a Communications Plan
Summary
Chapter 9. Data Quality in the Real World: Conversations and Case Studies
Building a Data Mesh for Greater Data Quality
Domain-Oriented Data Owners and Pipelines
Self-Serve Functionality
Interoperability and Standardization of Communications
Why Implement a Data Mesh?
To Mesh or Not to Mesh? That Is the Question
Calculating Your Data Mesh Score
A Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data Mesh
Can You Build a Data Mesh from a Single Solution?
Is Data Mesh Another Word for Data Virtualization?
Does Each Data Product Team Manage Their Own Separate Data Stores?
Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh?
Is the Data Mesh Right for All Data Teams?
Does One Person on Your Team “Own” the Data Mesh?
Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts?
Case Study: Kolibri Games’ Data Stack Journey
First Data Needs
Pursuing Performance Marketing
2018: Professionalize and Centralize
Getting Data-Oriented
Getting Data-Driven
Building a Data Mesh
Five Key Takeaways from a Five-Year Data Evolution
Making Metadata Work for the Business
Unlocking the Value of Metadata with Data Discovery
Data Warehouse and Lake Considerations
Data Catalogs Can Drown in a Data Lake—or Even a Data Mesh
Moving from Traditional Data Catalogs to Modern Data Discovery
Deciding When to Get Started with Data Quality at Your Company
You’ve Recently Migrated to the Cloud
Your Data Stack Is Scaling with More Data Sources, More Tables, and More Complexity
Your Data Team Is Growing
Your Team Is Spending at Least 30% of Their Time Firefighting Data Quality Issues
Your Team Has More Data Consumers Than They Did One Year Ago
Your Company Is Moving to a Self-Service Analytics Model
Data Is a Key Part of the Customer Value Proposition
Data Quality Starts with Trust
Summary
Chapter 10. Pioneering the Future of Reliable Data Systems
Be Proactive, Not Reactive
Predictions for the Future of Data Quality and Reliability
Data Warehouses and Lakes Will Merge
Emergence of New Roles on the Data Team
Rise of Automation
More Distributed Environments and the Rise of Data Domains
So Where Do We Go from Here?
Index
About the Authors
Colophon