Delta Lake: Up and Running: Modern Data Lakehouse Architectures with Delta Lake

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS. This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights. You'll learn how to: • Use modern data management and data engineering techniques • Understand how ACID transactions bring reliability to data lakes at scale • Run streaming and batch jobs against your data lake concurrently • Execute update, delete, and merge commands against your data lake • Use time travel to roll back and examine previous data versions • Build a streaming data quality pipeline following the medallion architecture

Author(s): Bennie Haelen, Dan Davis
Edition: 1
Publisher: O'Reilly Media
Year: 2023

Language: English
Commentary: Publisher's PDF
Pages: 264
City: Sebastopol, CA
Tags: Big Data; SQL; Stream Processing; PySpark; Performance Tuning; Data Lake; Delta Lake; Data Architecture; Spark

Copyright
Table of Contents
Preface
How to Contact Us
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
Acknowledgment
Chapter 1. The Evolution of Data Architectures
A Brief History of Relational Databases
Data Warehouses
Data Warehouse Architecture
Dimensional Modeling
Data Warehouse Benefits and Challenges
Introducing Data Lakes
Data Lakehouse
Data Lakehouse Benefits
Implementing a Lakehouse
Delta Lake
The Medallion Architecture
The Delta Ecosystem
Delta Lake Storage
Delta Sharing
Delta Connectors
Conclusion
Chapter 2. Getting Started with Delta Lake
Getting a Standard Spark Image
Using Delta Lake with PySpark
Running Delta Lake in the Spark Scala Shell
Running Delta Lake on Databricks
Creating and Running a Spark Program: helloDeltaLake
The Delta Lake Format
Parquet Files
Writing a Delta Table
The Delta Lake Transaction Log
How the Transaction Log Implements Atomicity
Breaking Down Transactions into Atomic Commits
The Transaction Log at the File Level
Scaling Massive Metadata
Conclusion
Chapter 3. Basic Operations on Delta Tables
Creating a Delta Table
Creating a Delta Table with SQL DDL
The DESCRIBE Statement
Creating Delta Tables with the DataFrameWriter API
Creating a Delta Table with the DeltaTableBuilder API
Generated Columns
Reading a Delta Table
Reading a Delta Table with SQL
Reading a Table with PySpark
Writing to a Delta Table
Cleaning Out the YellowTaxis Table
Inserting Data with SQL INSERT
Appending a DataFrame to a Table
Using the OverWrite Mode When Writing to a Delta Table
Inserting Data with the SQL COPY INTO Command
Partitions
User-Defined Metadata
Using SparkSession to Set Custom Metadata
Using the DataFrameWriter to Set Custom Metadata
Conclusion
Chapter 4. Table Deletes, Updates, and Merges
Deleting Data from a Delta Table
Table Creation and DESCRIBE HISTORY
Performing the DELETE Operation
DELETE Performance Tuning Tips
Updating Data in a Table
Use Case Description
Updating Data in a Table
UPDATE Performance Tuning Tips
Upsert Data Using the MERGE Operation
Use Case Description
The MERGE Dataset
The MERGE Statement
Analyzing the MERGE operation with DESCRIBE HISTORY
Inner Workings of the MERGE Operation
Conclusion
Chapter 5. Performance Tuning
Data Skipping
Partitioning
Partitioning Warnings and Considerations
Compact Files
Compaction
OPTIMIZE
ZORDER BY
ZORDER BY Considerations
Liquid Clustering
Enabling Liquid Clustering
Operations on Clustered Columns
Liquid Clustering Warnings and Considerations
Conclusion
Chapter 6. Using Time Travel
Delta Lake Time Travel
Restoring a Table
Restoring via Timestamp
Time Travel Under the Hood
RESTORE Considerations and Warnings
Querying an Older Version of a Table
Data Retention
Data File Retention
Log File Retention
Setting File Retention Duration Example
Data Archiving
VACUUM
VACUUM Syntax and Examples
How Often Should You Run VACUUM and Other Maintenance Tasks?
VACUUM Warnings and Considerations
Changing Data Feed
Enabling the CDF
Viewing the CDF
CDF Warnings and Considerations
Conclusion
Chapter 7. Schema Handling
Schema Validation
Viewing the Schema in the Transaction Log Entries
Schema on Write
Schema Enforcement Example
Schema Evolution
Adding a Column
Missing Data Column in Source DataFrame
Changing a Column Data Type
Adding a NullType Column
Explicit Schema Updates
Adding a Column to a Table
Adding Comments to a Column
Changing Column Ordering
Delta Lake Column Mapping
Renaming a Column
Replacing the Table Columns
Dropping a Column
The REORG TABLE Command
Changing Column Data Type or Name
Conclusion
Chapter 8. Operations on Streaming Data
Streaming Overview
Spark Structured Streaming
Delta Lake and Structured Streaming
Streaming Examples
Hello Streaming World
AvailableNow Streaming
Updating the Source Records
Reading a Stream from the Change Data Feed
Conclusion
Chapter 9. Delta Sharing
Conventional Methods of Data Sharing
Legacy and Homegrown Solutions
Proprietary Vendor Solutions
Cloud Object Storage
Open Source Delta Sharing
Delta Sharing Goals
Delta Sharing Under the Hood
Data Providers and Recipients
Benefits of the Design
The delta-sharing Repository
Step 1: Installing the Python Connector
Step 2: Installing the Profile File
Step 3: Reading a Shared Table
Conclusion
Chapter 10. Building a Lakehouse on Delta Lake
Storage Layer
What Is a Data Lake?
Types of Data
Key Benefits of a Cloud Data Lake
Data Management
SQL Analytics
SQL Analytics via Spark SQL
SQL Analytics via Other Delta Lake Integrations
Data for Data Science and Machine Learning
Challenges with Traditional Machine Learning
Delta Lake Features That Support Machine Learning
Putting It All Together
Medallion Architecture
The Bronze Layer (Raw Data)
The Silver Layer
The Gold Layer
The Complete Lakehouse
Conclusion
Index
About the Author
Colophon