Delta Lake: Up and Running: Modern Data Lakehouse Architectures with Delta Lake

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS.

This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights.

You'll learn how to:

  • Use modern data management and data engineering techniques
  • Understand how ACID transactions bring reliability to data lakes at scale
  • Run streaming and batch jobs against your data lake concurrently
  • Execute update, delete, and merge commands against your data...
  • Author(s): Bennie Haelen; Dan Davis
    Publisher: O'Reilly Media
    Year: 2023

    Language: English
    Pages: 264

    Preface
    How to Contact Us
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    Acknowledgment
    1. The Evolution of Data Architectures
    A Brief History of Relational Databases
    Data Warehouses
    Data Warehouse Architecture
    Dimensional Modeling
    Data Warehouse Benefits and Challenges
    Introducing Data Lakes
    Data Lakehouse
    Data Lakehouse Benefits
    Implementing a Lakehouse
    Delta Lake
    The Medallion Architecture
    The Delta Ecosystem
    Delta Lake Storage
    Delta Sharing
    Delta Connectors
    Conclusion
    2. Getting Started with Delta Lake
    Getting a Standard Spark Image
    Using Delta Lake with PySpark
    Running Delta Lake in the Spark Scala Shell
    Running Delta Lake on Databricks
    Creating and Running a Spark Program: helloDeltaLake
    The Delta Lake Format
    Parquet Files
    Advantages of Parquet files
    Writing a Parquet file
    Writing a Delta Table
    The Delta Lake Transaction Log
    How the Transaction Log Implements Atomicity
    Breaking Down Transactions into Atomic Commits
    The Transaction Log at the File Level
    Write multiple writes to the same file
    Reading the latest version of a Delta table
    Failure scenario with a write operation
    Update scenario
    Scaling Massive Metadata
    Checkpoint file example
    Displaying the checkpoint file
    Conclusion
    3. Basic Operations on Delta Tables
    Creating a Delta Table
    Creating a Delta Table with SQL DDL
    The DESCRIBE Statement
    Creating Delta Tables with the DataFrameWriter API
    Creating a managed table
    Creating an unmanaged table
    Creating a Delta Table with the DeltaTableBuilder API
    Generated Columns
    Reading a Delta Table
    Reading a Delta Table with SQL
    Reading a Table with PySpark
    Writing to a Delta Table
    Cleaning Out the YellowTaxis Table
    Inserting Data with SQL INSERT
    Appending a DataFrame to a Table
    Using the OverWrite Mode When Writing to a Delta Table
    Inserting Data with the SQL COPY INTO Command
    Partitions
    Partitioning by a single column
    Partitioning by multiple columns
    Checking if a partition exists
    Selectively updating Delta partitions with replaceWhere
    User-Defined Metadata
    Using SparkSession to Set Custom Metadata
    Using the DataFrameWriter to Set Custom Metadata
    Conclusion
    4. Table Deletes, Updates, and Merges
    Deleting Data from a Delta Table
    Table Creation and DESCRIBE HISTORY
    Performing the DELETE Operation
    DELETE Performance Tuning Tips
    Updating Data in a Table
    Use Case Description
    Updating Data in a Table
    UPDATE Performance Tuning Tips
    Upsert Data Using the MERGE Operation
    Use Case Description
    The MERGE Dataset
    The MERGE Statement
    Modifying unmatched rows using MERGE
    Analyzing the MERGE operation with DESCRIBE HISTORY
    Inner Workings of the MERGE Operation
    Conclusion
    5. Performance Tuning
    Data Skipping
    Partitioning
    Partitioning Warnings and Considerations
    Compact Files
    Compaction
    OPTIMIZE
    OPTIMIZE considerations
    ZORDER BY
    ZORDER BY Considerations
    Liquid Clustering
    Enabling Liquid Clustering
    Operations on Clustered Columns
    Changing clustered columns
    Viewing clustered columns
    Removing clustered columns
    Liquid Clustering Warnings and Considerations
    Conclusion
    6. Using Time Travel
    Delta Lake Time Travel
    Restoring a Table
    Restoring via Timestamp
    Time Travel Under the Hood
    RESTORE Considerations and Warnings
    Querying an Older Version of a Table
    Data Retention
    Data File Retention
    Log File Retention
    Setting File Retention Duration Example
    Data Archiving
    VACUUM
    VACUUM Syntax and Examples
    How Often Should You Run VACUUM and Other Maintenance Tasks?
    VACUUM Warnings and Considerations
    Changing Data Feed
    Enabling the CDF
    Viewing the CDF
    CDF Warnings and Considerations
    Conclusion
    7. Schema Handling
    Schema Validation
    Viewing the Schema in the Transaction Log Entries
    Schema on Write
    Schema Enforcement Example
    Matching schema
    Schema with an additional column
    Schema Evolution
    Adding a Column
    Missing Data Column in Source DataFrame
    Changing a Column Data Type
    Adding a NullType Column
    Explicit Schema Updates
    Adding a Column to a Table
    Adding Comments to a Column
    Changing Column Ordering
    Delta Lake Column Mapping
    Renaming a Column
    Replacing the Table Columns
    Dropping a Column
    The REORG TABLE Command
    Changing Column Data Type or Name
    Conclusion
    8. Operations on Streaming Data
    Streaming Overview
    Spark Structured Streaming
    Delta Lake and Structured Streaming
    Streaming Examples
    Hello Streaming World
    Creating the streaming query
    The query process log
    The checkpoint file
    AvailableNow Streaming
    Updating the Source Records
    The StreamingQuery class
    Reprocessing all or part of the source records
    Reading a Stream from the Change Data Feed
    Conclusion
    9. Delta Sharing
    Conventional Methods of Data Sharing
    Legacy and Homegrown Solutions
    Proprietary Vendor Solutions
    Cloud Object Storage
    Open Source Delta Sharing
    Delta Sharing Goals
    Delta Sharing Under the Hood
    Data Providers and Recipients
    Benefits of the Design
    The delta-sharing Repository
    Step 1: Installing the Python Connector
    Step 2: Installing the Profile File
    Step 3: Reading a Shared Table
    Conclusion
    10. Building a Lakehouse on Delta Lake
    Storage Layer
    What Is a Data Lake?
    Types of Data
    Key Benefits of a Cloud Data Lake
    Data Management
    SQL Analytics
    SQL Analytics via Spark SQL
    SQL Analytics via Other Delta Lake Integrations
    Data for Data Science and Machine Learning
    Challenges with Traditional Machine Learning
    Delta Lake Features That Support Machine Learning
    Putting It All Together
    Medallion Architecture
    The Bronze Layer (Raw Data)
    The Silver Layer
    The Gold Layer
    The Complete Lakehouse
    Conclusion
    Index