Data Pipelines Pocket Reference: Moving and Processing Data for Analytics

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Data pipelines are the foundation for success in data analytics. Moving data from numerous diverse sources and transforming it to provide context is the difference between having data and actually gaining value from it. This pocket reference defines data pipelines and explains how they work in today's modern data stack. You'll learn common considerations and key decision points when implementing pipelines, such as batch versus streaming data ingestion and build versus buy. This book addresses the most common decisions made by data professionals and discusses foundational concepts that apply to open source frameworks, commercial products, and homegrown solutions. You'll learn: • What a data pipeline is and how it works • How data is moved and processed on modern data infrastructure, including cloud platforms • Common tools and products used by data engineers to build pipelines • How pipelines support analytics and reporting needs • Considerations for pipeline maintenance, testing, and alerting

Author(s): James Densmore
Edition: 1
Publisher: O'Reilly Media
Year: 2021

Language: English
Commentary: Vector PDF
Pages: 276
City: Sebastopol, CA
Tags: Data Analysis; Python; Java; SQL; Distributed Systems; Monitoring; System Administration; Logging; Pipelines; Best Practices; Data Warehouse; Directed Acyclic Graphs; Data Integration; Pipeline Orchestration; Data Ingestion; Data Validation; Data Infrastructure

Copyright
Table of Contents
Preface
Who This Book Is For
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Introduction to Data Pipelines
What Are Data Pipelines?
Who Builds Data Pipelines?
SQL and Data Warehousing Fundamentals
Python and/or Java
Distributed Computing
Basic System Administration
A Goal-Oriented Mentality
Why Build Data Pipelines?
How Are Pipelines Built?
Chapter 2. A Modern Data Infrastructure
Diversity of Data Sources
Source System Ownership
Ingestion Interface and Data Structure
Data Volume
Data Cleanliness and Validity
Latency and Bandwidth of the Source System
Cloud Data Warehouses and Data Lakes
Data Ingestion Tools
Data Transformation and Modeling Tools
Workflow Orchestration Platforms
Directed Acyclic Graphs
Customizing Your Data Infrastructure
Chapter 3. Common Data Pipeline Patterns
ETL and ELT
The Emergence of ELT over ETL
EtLT Subpattern
ELT for Data Analysis
ELT for Data Science
ELT for Data Products and Machine Learning
Steps in a Machine Learning Pipeline
Incorporate Feedback in the Pipeline
Further Reading on ML Pipelines
Chapter 4. Data Ingestion: Extracting Data
Setting Up Your Python Environment
Setting Up Cloud File Storage
Extracting Data from a MySQL Database
Full or Incremental MySQL Table Extraction
Binary Log Replication of MySQL Data
Extracting Data from a PostgreSQL Database
Full or Incremental Postgres Table Extraction
Replicating Data Using the Write-Ahead Log
Extracting Data from MongoDB
Extracting Data from a REST API
Streaming Data Ingestions with Kafka and Debezium
Chapter 5. Data Ingestion: Loading Data
Configuring an Amazon Redshift Warehouse as a Destination
Loading Data into a Redshift Warehouse
Incremental Versus Full Loads
Loading Data Extracted from a CDC Log
Configuring a Snowflake Warehouse as a Destination
Loading Data into a Snowflake Data Warehouse
Using Your File Storage as a Data Lake
Open Source Frameworks
Commercial Alternatives
Chapter 6. Transforming Data
Noncontextual Transformations
Deduplicating Records in a Table
Parsing URLs
When to Transform? During or After Ingestion?
Data Modeling Foundations
Key Data Modeling Terms
Modeling Fully Refreshed Data
Slowly Changing Dimensions for Fully Refreshed Data
Modeling Incrementally Ingested Data
Modeling Append-Only Data
Modeling Change Capture Data
Chapter 7. Orchestrating Pipelines
Apache Airflow Setup and Overview
Installing and Configuring
Airflow Database
Web Server and UI
Scheduler
Executors
Operators
Building Airflow DAGs
A Simple DAG
An ELT Pipeline DAG
Additional Pipeline Tasks
Alerts and Notifications
Data Validation Checks
Advanced Orchestration Configurations
Coupled Versus Uncoupled Pipeline Tasks
When to Split Up DAGs
Coordinating Multiple DAGs with Sensors
Managed Airflow Options
Other Orchestration Frameworks
Chapter 8. Data Validation in Pipelines
Validate Early, Validate Often
Source System Data Quality
Data Ingestion Risks
Enabling Data Analyst Validation
A Simple Validation Framework
Validator Framework Code
Structure of a Validation Test
Running a Validation Test
Usage in an Airflow DAG
When to Halt a Pipeline, When to Warn and Continue
Extending the Framework
Validation Test Examples
Duplicate Records After Ingestion
Unexpected Change in Row Count After Ingestion
Metric Value Fluctuations
Commercial and Open Source Data Validation Frameworks
Chapter 9. Best Practices for Maintaining Pipelines
Handling Changes in Source Systems
Introduce Abstraction
Maintain Data Contracts
Limits of Schema-on-Read
Scaling Complexity
Standardizing Data Ingestion
Reuse of Data Model Logic
Ensuring Dependency Integrity
Chapter 10. Measuring and Monitoring Pipeline Performance
Key Pipeline Metrics
Prepping the Data Warehouse
A Data Infrastructure Schema
Logging and Ingesting Performance Data
Ingesting DAG Run History from Airflow
Adding Logging to the Data Validator
Transforming Performance Data
DAG Success Rate
DAG Runtime Change Over Time
Validation Test Volume and Success Rate
Orchestrating a Performance Pipeline
The Performance DAG
Performance Transparency
Index