Simplifying Data Engineering and Analytics with Delta: Create analytics-ready data that fuels artificial intelligence and business intelligence Anindita Mahapatra

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Who this book is for...cunts Data engineers, data scientists, ML practitioners, BI analysts, or anyone in the data domain working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book.

Author(s): Anindita Mahapatra
Year: 2022

Language: English

Cover
Title Page
Copyright and Credits
Foreword
Contributors
Table of Contents
Preface
Section 1 – Introduction to Delta Lake and Data Engineering Principles
Chapter 1: Introduction to Data Engineering
The motivation behind data engineering
Use cases
How big is big data?
But isn't ML and AI all the rage today?
Understanding the role of data personas
Big data ecosystem
What characterizes big data?
Classifying data
Reaping value from data
Top challenges of big data systems
Evolution of data systems
Rise of cloud data platforms
SQL and NoSQL systems
OLTP and OLAP systems
Distributed computing
SMP and MPP computing
Parallel and distributed computing
Business justification for tech spending
Strategy for business transformation to use data as an asset
Big data trends and best practices
Summary
Chapter 2: Data Modeling and ETL
Technical requirements
What is data modeling and why should you care?
Advantages of a data modeling exercise
Stages of data modeling
Data modeling approaches for different data stores
Understanding metadata – data about data
Data catalog
Types of metadata
Why is metadata management the nerve center of data?
Moving and transforming data using ETL
Scenarios to consider for building ETL pipelines
Job orchestration
How to choose the right data format
Text format versus binary format
Row versus column formats
When to use which format
Leveraging data compression
Common big data design patterns
Ingestion
Transformations
Persist
Summary
Further reading
Chapter 3: Delta – The Foundation Block for Big Data
Technical requirements
Motivation for Delta
A case of too many is too little
Data silos to data swamps
Characteristics of curated data lakes
DDL commands
DML commands
APPEND
Demystifying Delta
Format layout on disk
The main features of Delta
ACID transaction support
Schema evolution
Unifying batch and streaming workloads
Time travel
Performance
Life with and without Delta
Lakehouse
Summary
Section 2 – End-to-End Process of Building Delta Pipelines
Chapter 4: Unifying Batch and Streaming with Delta
Technical requirements
Moving toward real-time systems
Streaming concepts
Lambda versus Kappa architectures
Streaming ETL
Extract – file-based versus event-based streaming
Transforming – stream processing
Loading – persisting the stream
Handling streaming scenarios
Joining with other static and dynamic datasets
Recovering from failures
Handling late-arriving data
Stateless and stateful in-stream operations
Trade-offs in designing streaming architectures
Cost trade-offs
Handling latency trade-offs
Data reprocessing
Multi-tenancy
De-duplication
Streaming best practices
Summary
Chapter 5: Data Consolidation in Delta Lake
Technical requirements
Why consolidate disparate data types?
Delta unifies all types of data
Structured data
Semi-structured data
Unstructured data
Avoiding patches of data darkness
Addressing problems in flight status using Delta
Augmenting domain knowledge constraints to quality
Continuous quality monitoring
Curating data in stages for analytics
RDD, DataFrames, and datasets
Spark transformations and actions
Spark APIs and UDFs
Ease of extending to existing and new use cases
Delta Lake connectors
Specialized Delta Lakes by industry
Data governance
GDPR and CCPA compliance
Role-based data access
Summary
Chapter 6: Solving Common Data Pattern Scenarios with Delta
Technical requirements
Understanding use case requirements
Minimizing data movement with Delta time travel
Delta cloning
Handling CDC
CDC
Change Data Feed (CDF)
Handling Slowly Changing Dimensions (SCD)
SCD Type 1
SCD Type 2
Summary
Chapter 7: Delta for Data Warehouse Use Cases
Technical requirements
Choosing the right architecture
Understanding what a data warehouse really solves
Lacunas of data warehouses
Discovering when a data lake does not suffice
Addressing concurrency and latency requirements with Delta
Visualizing data using BI reporting
Can cubes be constructed with Delta?
Analyzing tradeoffs in a push versus pull data flow
Why is being open such a big deal?
Considerations around data governance
The rise of the lakehouse category
Summary
Chapter 8: Handling Atypical Data Scenarios with Delta
Technical requirements
Emphasizing the importance of exploratory data analysis (EDA)
From big data to good data
Data profiling
Statistical analysis
Applying sampling techniques to address class imbalance
How to detect and address imbalance
Synthetic data generation to deal with data imbalance
Addressing data skew
Providing data anonymity
Handling bias and variance in data
Bias versus variance
How do we detect bias and variance?
How do we fix bias and variance?
Compensating for missing and out-of-range data
Monitoring data drift
Summary
Chapter 9: Delta for Reproducible Machine Learning Pipelines
Technical requirements
Data science versus machine learning
Challenges of ML development
Formalizing the ML development process
What is a model?
What is MLOps?
Aspirations of a modern ML platform
The role of Delta in an ML pipeline
Delta-backed feature store
Delta-backed model training
Delta-backed model inferencing
Model monitoring with Delta
From business problem to insight generation
Summary
Chapter 10: Delta for Data Products and Services
Technical requirements
DaaS
The need for data democratization
Delta for unstructured data
NLP data (text and audio)
Image and video data
Data mashups using Delta
Data blending
Data harmonization
Federated query
Facilitating data sharing with Delta
Setting up Delta sharing
Benefits of Delta sharing
Data clean room
Summary
Section 3 – Operationalizing and Productionalizing Delta Pipelines
Chapter 11: Operationalizing Data and ML Pipelines
Technical requirements
Why operationalize?
Understanding and monitoring SLAs
Scaling and high availability
Planning for DR 
How to decide on the correct DR strategy
How Delta helps with DR
Guaranteeing data quality
Automation of CI/CD pipelines 
Code under version control
Infrastructure as Code (IaC)
Unit and integration testing
Data as code – An intelligent pipeline
Summary
Chapter 12: Optimizing Cost and Performance with Delta
Technical requirements
Improving performance with common strategies
Where to look and what to look for
Optimizing with Delta
Changing the data layout in storage
Other platform optimizations
Automation
Is cost always inversely proportional to performance?
Best practices for managing performance
Summary
Chapter 13: Managing Your Data Journey
Provisioning a multi-tenant infrastructure
Data democratization via policies and processes
Capacity planning
Managing and monitoring
Data sharing
Data migration
COE best practices
Summary
Index
Other Books You May Enjoy
Copyright
Title Page
Dedication
Contents
Chapter 1: ‘I’m thinking’ – Oh, but are you?
Chapter 2: Renegade perception
Chapter 3: The Pushbacker sting
Chapter 4: ‘Covid’: The calculated catastrophe
Chapter 5: There is no ‘virus’
Chapter 6: Sequence of deceit
Chapter 7: War on your mind
Chapter 8: ‘Reframing’ insanity
Chapter 9: We must have it? So what is it?
Chapter 10: Human 2.0
Chapter 11: Who controls the Cult?
Chapter 12: Escaping Wetiko
Postscript
Appendix: Cowan-Kaufman-Morell Statement on Virus Isolation
Bibliography
Index