With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS.
This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights.
You'll learn how to:
Use modern data management and data engineering techniquesUnderstand how ACID transactions bring reliability to data lakes at scaleRun streaming and batch jobs against your data lake concurrentlyExecute update, delete, and merge commands against your data...
Author(s): Bennie Haelen; Dan Davis
Publisher: O'Reilly Media
Year: 2023
Language: English
Pages: 264
Preface
How to Contact Us
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
Acknowledgment
1. The Evolution of Data Architectures
A Brief History of Relational Databases
Data Warehouses
Data Warehouse Architecture
Dimensional Modeling
Data Warehouse Benefits and Challenges
Introducing Data Lakes
Data Lakehouse
Data Lakehouse Benefits
Implementing a Lakehouse
Delta Lake
The Medallion Architecture
The Delta Ecosystem
Delta Lake Storage
Delta Sharing
Delta Connectors
Conclusion
2. Getting Started with Delta Lake
Getting a Standard Spark Image
Using Delta Lake with PySpark
Running Delta Lake in the Spark Scala Shell
Running Delta Lake on Databricks
Creating and Running a Spark Program: helloDeltaLake
The Delta Lake Format
Parquet Files
Advantages of Parquet files
Writing a Parquet file
Writing a Delta Table
The Delta Lake Transaction Log
How the Transaction Log Implements Atomicity
Breaking Down Transactions into Atomic Commits
The Transaction Log at the File Level
Write multiple writes to the same file
Reading the latest version of a Delta table
Failure scenario with a write operation
Update scenario
Scaling Massive Metadata
Checkpoint file example
Displaying the checkpoint file
Conclusion
3. Basic Operations on Delta Tables
Creating a Delta Table
Creating a Delta Table with SQL DDL
The DESCRIBE Statement
Creating Delta Tables with the DataFrameWriter API
Creating a managed table
Creating an unmanaged table
Creating a Delta Table with the DeltaTableBuilder API
Generated Columns
Reading a Delta Table
Reading a Delta Table with SQL
Reading a Table with PySpark
Writing to a Delta Table
Cleaning Out the YellowTaxis Table
Inserting Data with SQL INSERT
Appending a DataFrame to a Table
Using the OverWrite Mode When Writing to a Delta Table
Inserting Data with the SQL COPY INTO Command
Partitions
Partitioning by a single column
Partitioning by multiple columns
Checking if a partition exists
Selectively updating Delta partitions with replaceWhere
User-Defined Metadata
Using SparkSession to Set Custom Metadata
Using the DataFrameWriter to Set Custom Metadata
Conclusion
4. Table Deletes, Updates, and Merges
Deleting Data from a Delta Table
Table Creation and DESCRIBE HISTORY
Performing the DELETE Operation
DELETE Performance Tuning Tips
Updating Data in a Table
Use Case Description
Updating Data in a Table
UPDATE Performance Tuning Tips
Upsert Data Using the MERGE Operation
Use Case Description
The MERGE Dataset
The MERGE Statement
Modifying unmatched rows using MERGE
Analyzing the MERGE operation with DESCRIBE HISTORY
Inner Workings of the MERGE Operation
Conclusion
5. Performance Tuning
Data Skipping
Partitioning
Partitioning Warnings and Considerations
Compact Files
Compaction
OPTIMIZE
OPTIMIZE considerations
ZORDER BY
ZORDER BY Considerations
Liquid Clustering
Enabling Liquid Clustering
Operations on Clustered Columns
Changing clustered columns
Viewing clustered columns
Removing clustered columns
Liquid Clustering Warnings and Considerations
Conclusion
6. Using Time Travel
Delta Lake Time Travel
Restoring a Table
Restoring via Timestamp
Time Travel Under the Hood
RESTORE Considerations and Warnings
Querying an Older Version of a Table
Data Retention
Data File Retention
Log File Retention
Setting File Retention Duration Example
Data Archiving
VACUUM
VACUUM Syntax and Examples
How Often Should You Run VACUUM and Other Maintenance Tasks?
VACUUM Warnings and Considerations
Changing Data Feed
Enabling the CDF
Viewing the CDF
CDF Warnings and Considerations
Conclusion
7. Schema Handling
Schema Validation
Viewing the Schema in the Transaction Log Entries
Schema on Write
Schema Enforcement Example
Matching schema
Schema with an additional column
Schema Evolution
Adding a Column
Missing Data Column in Source DataFrame
Changing a Column Data Type
Adding a NullType Column
Explicit Schema Updates
Adding a Column to a Table
Adding Comments to a Column
Changing Column Ordering
Delta Lake Column Mapping
Renaming a Column
Replacing the Table Columns
Dropping a Column
The REORG TABLE Command
Changing Column Data Type or Name
Conclusion
8. Operations on Streaming Data
Streaming Overview
Spark Structured Streaming
Delta Lake and Structured Streaming
Streaming Examples
Hello Streaming World
Creating the streaming query
The query process log
The checkpoint file
AvailableNow Streaming
Updating the Source Records
The StreamingQuery class
Reprocessing all or part of the source records
Reading a Stream from the Change Data Feed
Conclusion
9. Delta Sharing
Conventional Methods of Data Sharing
Legacy and Homegrown Solutions
Proprietary Vendor Solutions
Cloud Object Storage
Open Source Delta Sharing
Delta Sharing Goals
Delta Sharing Under the Hood
Data Providers and Recipients
Benefits of the Design
The delta-sharing Repository
Step 1: Installing the Python Connector
Step 2: Installing the Profile File
Step 3: Reading a Shared Table
Conclusion
10. Building a Lakehouse on Delta Lake
Storage Layer
What Is a Data Lake?
Types of Data
Key Benefits of a Cloud Data Lake
Data Management
SQL Analytics
SQL Analytics via Spark SQL
SQL Analytics via Other Delta Lake Integrations
Data for Data Science and Machine Learning
Challenges with Traditional Machine Learning
Delta Lake Features That Support Machine Learning
Putting It All Together
Medallion Architecture
The Bronze Layer (Raw Data)
The Silver Layer
The Gold Layer
The Complete Lakehouse
Conclusion
Index