Automating Data Quality Monitoring at Scale: Going Deeper than Data Observability (Third Early Release)

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The world's businesses ingest a combined 2.5 quintillion bytes of data every day. But how much of this vast amount of data--used to build products, power AI systems, and drive business decisions--is poor quality or just plain bad? This practical book shows you how to ensure that the data your organization relies on contains only high-quality records. Most data engineers, data analysts, and data scientists genuinely care about data quality, but they often don't have the time, resources, or understanding to create a data quality monitoring solution that succeeds at scale. In this book, Jeremy Stanley and Paige Schwartz from Anomalo explain how you can use automated data quality monitoring to cover all your tables efficiently, proactively alert on every category of issue, and resolve problems immediately. We’ll wrap up by introducing the data quality monitoring strategy we advocate for in this book: a three-pillar approach combining rules, metrics monitoring, and Unsupervised Machine Learning. As we’ll show, this approach has multiple benefits. It allows subject matter experts to enforce essential constraints and track KPIs for important tables, while providing a base level of automated monitoring for a large volume of diverse data. This approach doesn’t require massive computer power or legions of analysts to maintain rules and thresholds. With machine learning, it will detect “unknown unknowns” in the data and reduce alert fatigue by understanding correlations and trends in the data values across columns and even across tables, alerting only when changes are new and significant This book will help you: Learn why data quality is a business imperative Understand and assess unsupervised learning models for detecting data issues Implement notifications that reduce alert fatigue and let you triage and resolve issues quickly Integrate automated data quality monitoring with data catalogs, orchestration layers, and BI and ML systems Understand the limits of automated data quality monitoring and how to overcome them Learn how to deploy and manage your monitoring solution at scale Maintain automated data quality monitoring for the long term

Author(s): Jeremy Stanley and Paige Schwartz
Publisher: O'Reilly Media, Inc.
Year: 2023

Language: English
Commentary: early release, raw and unedited