Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Deploy your data ingestion pipeline, orchestrate, and monitor efficiently to prevent loss of data and quality Key Features • Harness best practices to create a Python and PySpark data ingestion pipeline • Seamlessly automate and orchestrate your data pipelines using Apache Airflow • Build a monitoring framework by integrating the concept of data observability into your pipelines Book Description Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges. You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation. By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process. What you will learn • Implement data observability using monitoring tools • Automate your data ingestion pipeline • Read analytical and partitioned data, whether schema or non-schema based • Debug and prevent data loss through efficient data monitoring and logging • Establish data access policies using a data governance framework • Construct a data orchestration framework to improve data quality Who this book is for This book is for data engineers and data enthusiasts seeking a comprehensive understanding of the data ingestion process using popular tools in the open source community. For more advanced learners, this book takes on the theoretical pillars of data governance while providing practical examples of real-world scenarios commonly encountered by data engineers.

Author(s): Gláucia Esppenchutz
Edition: 1
Publisher: Packt Publishing
Year: 2023

Language: English
Commentary: Publisher's PDF
Pages: 414
City: Birmingham, UK
Tags: Google Cloud Platform; Amazon Web Services; Analytics; Debugging; Python; NoSQL; MongoDB; Cookbook; Monitoring; Logging; Docker; JSON; Downtime; CSV; PySpark; SSH; Error Handling; Automation; AWS Simple Storage Service; Data Pipelines; Directed Acyclic Graphs; Data Ingestion; Apache Airflow; OpenMetadata; Data Observability

Cover
Title Page
Copyright and Credits
Dedications
Contributors
Table of Contents
Preface
Part 1: Fundamentals of Data Ingestion
Chapter 1: Introduction to Data Ingestion
Technical requirements
Setting up Python and its environment
Getting ready
How to do it…
How it works…
There’s more…
See also
Installing PySpark
Getting ready
How to do it…
How it works…
There’s more…
See also
Configuring Docker for MongoDB
Getting ready
How to do it…
How it works…
There’s more…
See also
Configuring Docker for Airflow
Getting ready
How to do it…
How it works…
See also
Creating schemas
Getting ready
How to do it…
How it works…
See also
Applying data governance in ingestion
Getting ready
How to do it…
How it works…
See also
Implementing data replication
Getting ready
How to do it…
How it works…
There’s more…
Further reading
Chapter 2: Principals of Data Access – Accessing Your Data
Technical requirements
Implementing governance in a data access workflow
Getting ready
How to do it…
How it works…
See also
Accessing databases and data warehouses
Getting ready
How to do it…
How it works…
There’s more…
See also
Accessing SSH File Transfer Protocol (SFTP) files
Getting ready
How to do it…
How it works…
There’s more…
See also
Retrieving data using API authentication
Getting ready
How to do it…
How it works…
There’s more…
See also
Managing encrypted files
Getting ready
How to do it…
How it works…
There’s more…
See also
Accessing data from AWS using S3
Getting ready
How to do it…
How it works…
There’s more…
See also
Accessing data from GCP using Cloud Storage
Getting ready
How to do it…
How it works…
There’s more…
Further reading
Chapter 3: Data Discovery – Understanding Our Data before Ingesting It
Technical requirements
Documenting the data discovery process
Getting ready
How to do it…
How it works…
Configuring OpenMetadata
Getting ready
How to do it…
How it works…
There’s more…
See also
Connecting OpenMetadata to our database
Getting ready
How to do it…
How it works…
Further reading
Other tools
Chapter 4: Reading CSV and JSON Files and Solving Problems
Technical requirements
Reading a CSV file
Getting ready
How to do it…
How it works…
There’s more…
See also
Reading a JSON file
Getting ready
How to do it…
How it works…
There’s more…
See also
Creating a SparkSession for PySpark
Getting ready
How to do it…
How it works…
There’s more…
See also
Using PySpark to read CSV files
Getting ready
How to do it…
How it works…
There’s more…
See also
Using PySpark to read JSON files
Getting ready
How to do it…
How it works…
There’s more…
See also
Further reading
Chapter 5: Ingesting Data from Structured and Unstructured Databases
Technical requirements
Configuring a JDBC connection
Getting ready
How to do it…
How it works…
There’s more…
See also
Ingesting data from a JDBC database using SQL
Getting ready
How to do it…
How it works…
There’s more…
See also
Connecting to a NoSQL database (MongoDB)
Getting ready
How to do it…
How it works…
There’s more…
See also
Creating our NoSQL table in MongoDB
Getting ready
How to do it…
How it works…
There’s more…
See also
Ingesting data from MongoDB using PySpark
Getting ready
How to do it…
How it works…
There’s more…
See also
Further reading
Chapter 6: Using PySpark with Defined and Non-Defined Schemas
Technical requirements
Applying schemas to data ingestion
Getting ready
How to do it…
How it works…
There’s more…
See also
Importing structured data using a well-defined schema
Getting ready
How to do it…
How it works…
There’s more…
See also
Importing unstructured data without a schema
Getting ready…
How to do it…
How it works…
Ingesting unstructured data with a well-defined schema and format
Getting ready
How to do it…
How it works…
There’s more…
See also
Inserting formatted SparkSession logs to facilitate your work
Getting ready
How to do it…
How it works…
There’s more…
See also
Further reading
Chapter 7: Ingesting Analytical Data
Technical requirements
Ingesting Parquet files
Getting ready
How to do it…
How it works…
There’s more…
See also
Ingesting Avro files
Getting ready
How to do it…
How it works…
There’s more…
See also
Applying schemas to analytical data
Getting ready
How to do it…
How it works…
There’s more…
See also
Filtering data and handling common issues
Getting ready
How to do it…
How it works…
There’s more…
See also
Ingesting partitioned data
Getting ready
How to do it…
How it works…
There’s more…
See also
Applying reverse ETL
Getting ready
How to do it…
How it works…
There’s more…
See also
Selecting analytical data for reverse ETL
Getting ready
How to do it…
How it works…
See also
Further reading
Part 2: Structuring the Ingestion Pipeline
Chapter 8: Designing Monitored Data Workflows
Technical requirements
Inserting logs
Getting ready
How to do it…
How it works…
See also
Using log-level types
Getting ready
How to do it…
How it works…
There’s more…
See also
Creating standardized logs
Getting ready
How to do it…
How it works…
There’s more…
See also
Monitoring our data ingest file size
Getting ready
How to do it…
How it works…
There’s more…
See also
Logging based on data
Getting ready
How to do it…
How it works…
There’s more…
See also
Retrieving SparkSession metrics
Getting ready
How to do it…
How it works…
There’s more…
See also
Further reading
Chapter 9: Putting Everything Together with Airflow
Technical requirements
Installing Airflow
Configuring Airflow
Getting ready
How to do it…
How it works…
See also
Creating DAGs
Getting ready
How to do it…
How it works…
There is more…
See also
Creating custom operators
Getting ready
How to do it…
How it works…
There is more…
See also
Configuring sensors
Getting ready
How to do it…
How it works…
See also
Creating connectors in Airflow
Getting ready
How to do it…
How it works…
There's more…
See also
Creating parallel ingest tasks
Getting ready
How to do it…
How it works…
There's more…
See also
Defining ingest-dependent DAGs
Getting ready
How to do it…
How it works…
There's more…
See also
Further reading
Chapter 10: Logging and Monitoring Your Data Ingest in Airflow
Technical requirements
Installing and running Airflow
Creating basic logs in Airflow
Getting ready
How to do it…
How it works…
See also
Storing log files in a remote location
Getting ready
How to do it…
How it works…
See also
Configuring logs in airflow.cfg
Getting ready
How to do it…
How it works…
There’s more…
See also
Designing advanced monitoring
Getting ready
How to do it…
How it works…
There’s more…
See also
Using notification operators
Getting ready
How to do it…
How it works…
There’s more…
Using SQL operators for data quality
Getting ready
How to do it…
How it works…
There’s more…
See also
Further reading
Chapter 11: Automating Your Data Ingestion Pipelines
Technical requirements
Installing and running Airflow
Scheduling daily ingestions
Getting ready
How to do it…
How it works…
There's more…
See also
Scheduling historical data ingestion
Getting ready
How to do it…
How it works…
There's more…
Scheduling data replication
Getting ready
How to do it…
How it works…
There's more…
Setting up the schedule_interval parameter
Getting ready
How to do it…
How it works…
See also
Solving scheduling errors
Getting ready
How to do it…
How it works…
There’s more…
Further reading
Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime
Technical requirements
Docker images
Setting up StatsD for monitoring
Getting ready
How to do it…
How it works…
See also
Setting up Prometheus for storing metrics
Getting ready
How to do it…
How it works…
There’s more…
Setting up Grafana for monitoring
Getting ready
How to do it…
How it works…
There’s more…
Creating an observability dashboard
Getting ready
How to do it…
How it works…
There’s more…
Setting custom alerts or notifications
Getting ready
How to do it…
How it works…
Further reading
Index
Other Books You May Enjoy