More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.
This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.Learn the benefits of a cloud-based big data strategy for your organizationGet guidance and best practices for designing performant and scalable data lakesExamine architecture and design choices,...
Author(s): Rukmani Gopalan
Publisher: O'Reilly Media
Why I Wrote This Book
Who Should Read This Book?
Introducing Klodars Corporation
Navigating the Book
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
1. Big Data—Beyond the Buzz
What Is Big Data?
Elastic Data Infrastructure—The Challenge
Cloud Computing Fundamentals
Cloud Computing Terminology
Value Proposition of the Cloud
Cloud Data Lake Architecture
Limitations of On-Premises Data Warehouse Solutions
What Is a Cloud Data Lake Architecture?
Benefits of a Cloud Data Lake Architecture
Defining Your Cloud Data Lake Journey
2. Big Data Architectures on the Cloud
Why Klodars Corporation Moves to the Cloud
Fundamentals of Cloud Data Lake Architectures
A Word on Variety of Data
Cloud Data Lake Storage
Big Data Analytics Engines
Real-time stream processing pipelines
Cloud Data Warehouses
Modern Data Warehouse Architecture
Sample Use Case for a Modern Data Warehouse Architecture
Benefits and Challenges of Modern Data Warehouse Architecture
Data Lakehouse Architecture
Reference Architecture for the Data Lakehouse
Sample Use Case for Data Lakehouse Architecture
Benefits and Challenges of the Data Lakehouse Architecture
Data Warehouses and Unstructured Data
Sample Use Case for a Data Mesh Architecture
Challenges and Benefits of a Data Mesh Architecture
What Is the Right Architecture for Me?
Know Your Customers
Know Your Business Drivers
Consider Your Growth and Future Scenarios
3. Design Considerations for Your Data Lake
Setting Up the Cloud Data Lake Infrastructure
Identify Your Goals
How Klodars Corporation defined the data lake goals
Plan Your Architecture and Deliverables
How Klodars Corporation planned their architecture and deliverables
Implement the Cloud Data Lake
Release and Operationalize
Organizing Data in Your Data Lake
A Day in the Life of Data
Data Lake Zones
Introduction to Data Governance
Actors Involved in Data Governance
Metadata Management, Data Catalog, and Data Sharing
Data Access Management
Data Quality and Observability
Data Governance at Klodars Corporation
Data Governance Wrap-Up
Manage Data Lake Costs
Demystifying Data Lake Costs on the Cloud
Data Lake Cost Strategy
Data Lake Environments and Associated Costs
Cost strategy based on data
Transactions and impact on costs
4. Scalable Data Lakes
A Sneak Peek into Scalability
What Is Scalability?
Scale in Our Day-to-Day Life
Scalability in Data Lake Architectures
Internals of Data Lake Processing Systems
Data Copy Internals
Components of a data copy solution
Understanding resource utilization of a data copy job
ELT/ETL Processing Internals
Components of an Apache Spark application
Understanding resource utilization of a Spark job
A Note on Other Interactive Queries
Considerations for Scalable Data Lake Solutions
Pick the Right Cloud Offerings
Hybrid and multicloud solutions
IaaS versus PaaS versus SaaS solutions
Cloud offerings for Klodars Corporation
Plan for Peak Capacity
Data Formats and Job Profile
5. Optimizing Cloud Data Lake Architectures for Performance
Basics of Measuring Performance
Goals and Metrics for Performance
Optimizing for Faster Performance
Cloud Data Lake Performance
SLAs, SLOs, and SLIs
Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs
Drivers of Performance
Performance Drivers for a Copy Job
Performance Drivers for a Spark Job
Optimization Principles and Techniques for Performance Tuning
Exploring Apache Parquet
Other popular data formats
How Klodars Corporation picked their data formats
Data Organization and Partitioning
Optimal data organization strategy for Klodars Corporation
Choosing the Right Configurations on Apache Spark
Minimize Overheads with Data Transfer
Premium Offerings and Performance
The Case of Bigger Virtual Machines
The Case of Flash Storage
6. Deep Dive on Data Formats
Why Do We Need These Open Data Formats?
Why Do We Need to Store Tabular Data?
Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage?
Why Was Delta Lake Founded?
Eliminate data silos across business analysts, data scientists, and data engineers
Provide a unified data and computational system for batch and real-time streaming data
Support bulk updates or changes to existing data
Handle errors due to schema changes and incorrect data
How Does Delta Lake Work?
When Do You Use Delta Lake?
Why Was Apache Iceberg Founded?
How Does Apache Iceberg Work?
When Do You Use Apache Iceberg?
Why Was Apache Hudi Founded?
How Does Apache Hudi Work?
When Do You Use Apache Hudi?
7. Decision Framework for Your Architecture
Cloud Data Lake Assessment
Cloud Data Lake Assessment Questionnaire
Analysis for Your Cloud Data Lake Assessment
Starting from Scratch
Migrating an Existing Data Lake or Data Warehouse to the Cloud
Improving an Existing Cloud Data Lake
Phase 1 of Decision Framework: Assess
Understand Customer Requirements
Understand Opportunities for Improvement
Know Your Business Drivers
Complete the Assess Phase by Prioritizing the Requirements
Phase 2 of Decision Framework: Define
Finalize the Design Choices for the Cloud Data Lake
Picking your architecture
Picking your cloud provider
Decision points for data lake migrations
Plan Your Cloud Data Lake Project Deliverables
Phase 3 of Decision Framework: Implement
Phase 4 of Decision Framework: Operationalize
8. Six Lessons for a Data Informed Future
Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data Lakes
Lesson 2: With Great Power Comes Great Responsibility—Data Is No Exception
Lesson 3: Customers Lead Technology, Not the Other Way Around
Lesson 4: Change Is Inevitable, so Be Prepared
Lesson 5: Build Empathy and Prioritize Ruthlessly
Lesson 6: Big Impact Does Not Happen Overnight
A. Cloud Data Lake Decision Framework Template
Phase 1: Assess Framework
Phase 2: Define Framework
Planning the Cloud Data Lake Deliverables
Phase 3: Implement Framework