The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.

This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.

  • Learn the benefits of a cloud-based big data strategy for your organization
  • Get guidance and best practices for designing performant and scalable data lakes
  • Examine architecture and design choices,...
  • Author(s): Rukmani Gopalan
    Publisher: O'Reilly Media
    Year: 2022

    Language: English
    Pages: 244

    Preface
    Why I Wrote This Book
    Who Should Read This Book?
    Introducing Klodars Corporation
    Navigating the Book
    Conventions Used in This Book
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    1. Big Data—Beyond the Buzz
    What Is Big Data?
    Elastic Data Infrastructure—The Challenge
    Cloud Computing Fundamentals
    Cloud Computing Terminology
    Value Proposition of the Cloud
    Cloud Data Lake Architecture
    Limitations of On-Premises Data Warehouse Solutions
    What Is a Cloud Data Lake Architecture?
    Benefits of a Cloud Data Lake Architecture
    Defining Your Cloud Data Lake Journey
    Summary
    2. Big Data Architectures on the Cloud
    Why Klodars Corporation Moves to the Cloud
    Fundamentals of Cloud Data Lake Architectures
    A Word on Variety of Data
    Cloud Data Lake Storage
    Big Data Analytics Engines
    MapReduce
    Apache Hadoop
    Apache Spark
    Real-time stream processing pipelines
    Cloud Data Warehouses
    Modern Data Warehouse Architecture
    Reference Architecture
    Sample Use Case for a Modern Data Warehouse Architecture
    Benefits and Challenges of Modern Data Warehouse Architecture
    Data Lakehouse Architecture
    Reference Architecture for the Data Lakehouse
    Data formats
    Metadata
    Compute engines
    Sample Use Case for Data Lakehouse Architecture
    Benefits and Challenges of the Data Lakehouse Architecture
    Data Warehouses and Unstructured Data
    Data Mesh
    Reference Architecture
    Sample Use Case for a Data Mesh Architecture
    Challenges and Benefits of a Data Mesh Architecture
    What Is the Right Architecture for Me?
    Know Your Customers
    Know Your Business Drivers
    Consider Your Growth and Future Scenarios
    Design Considerations
    Hybrid Approaches
    Summary
    3. Design Considerations for Your Data Lake
    Setting Up the Cloud Data Lake Infrastructure
    Identify Your Goals
    How Klodars Corporation defined the data lake goals
    Plan Your Architecture and Deliverables
    How Klodars Corporation planned their architecture and deliverables
    Implement the Cloud Data Lake
    Release and Operationalize
    Organizing Data in Your Data Lake
    A Day in the Life of Data
    Data Lake Zones
    Organization Mechanisms
    Introduction to Data Governance
    Actors Involved in Data Governance
    Data Classification
    Metadata Management, Data Catalog, and Data Sharing
    Data Access Management
    Data Quality and Observability
    Data Governance at Klodars Corporation
    Data Governance Wrap-Up
    Manage Data Lake Costs
    Demystifying Data Lake Costs on the Cloud
    Data Lake Cost Strategy
    Data Lake Environments and Associated Costs
    Cost strategy based on data
    Transactions and impact on costs
    Summary
    4. Scalable Data Lakes
    A Sneak Peek into Scalability
    What Is Scalability?
    Scale in Our Day-to-Day Life
    Scalability in Data Lake Architectures
    Internals of Data Lake Processing Systems
    Data Copy Internals
    Components of a data copy solution
    Understanding resource utilization of a data copy job
    ELT/ETL Processing Internals
    Components of an Apache Spark application
    Understanding resource utilization of a Spark job
    A Note on Other Interactive Queries
    Considerations for Scalable Data Lake Solutions
    Pick the Right Cloud Offerings
    Hybrid and multicloud solutions
    IaaS versus PaaS versus SaaS solutions
    Cloud offerings for Klodars Corporation
    Plan for Peak Capacity
    Data Formats and Job Profile
    Summary
    5. Optimizing Cloud Data Lake Architectures for Performance
    Basics of Measuring Performance
    Goals and Metrics for Performance
    Measuring Performance
    Optimizing for Faster Performance
    Cloud Data Lake Performance
    SLAs, SLOs, and SLIs
    Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs
    Drivers of Performance
    Performance Drivers for a Copy Job
    Performance Drivers for a Spark Job
    Optimization Principles and Techniques for Performance Tuning
    Data Formats
    Exploring Apache Parquet
    Other popular data formats
    How Klodars Corporation picked their data formats
    Data Organization and Partitioning
    Optimal data organization strategy for Klodars Corporation
    Choosing the Right Configurations on Apache Spark
    Minimize Overheads with Data Transfer
    Premium Offerings and Performance
    The Case of Bigger Virtual Machines
    The Case of Flash Storage
    Summary
    6. Deep Dive on Data Formats
    Why Do We Need These Open Data Formats?
    Why Do We Need to Store Tabular Data?
    Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage?
    Delta Lake
    Why Was Delta Lake Founded?
    Eliminate data silos across business analysts, data scientists, and data engineers
    Provide a unified data and computational system for batch and real-time streaming data
    Support bulk updates or changes to existing data
    Handle errors due to schema changes and incorrect data
    How Does Delta Lake Work?
    When Do You Use Delta Lake?
    Apache Iceberg
    Why Was Apache Iceberg Founded?
    How Does Apache Iceberg Work?
    When Do You Use Apache Iceberg?
    Apache Hudi
    Why Was Apache Hudi Founded?
    How Does Apache Hudi Work?
    Copy-on-write tables
    Merge-on-read tables
    When Do You Use Apache Hudi?
    Summary
    7. Decision Framework for Your Architecture
    Cloud Data Lake Assessment
    Cloud Data Lake Assessment Questionnaire
    Analysis for Your Cloud Data Lake Assessment
    Starting from Scratch
    Migrating an Existing Data Lake or Data Warehouse to the Cloud
    Improving an Existing Cloud Data Lake
    Phase 1 of Decision Framework: Assess
    Understand Customer Requirements
    Understand Opportunities for Improvement
    Know Your Business Drivers
    Complete the Assess Phase by Prioritizing the Requirements
    Phase 2 of Decision Framework: Define
    Finalize the Design Choices for the Cloud Data Lake
    Picking your architecture
    Picking your cloud provider
    Decision points for data lake migrations
    Plan Your Cloud Data Lake Project Deliverables
    Phase 3 of Decision Framework: Implement
    Phase 4 of Decision Framework: Operationalize
    Summary
    8. Six Lessons for a Data Informed Future
    Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data Lakes
    Lesson 2: With Great Power Comes Great Responsibility—Data Is No Exception
    Lesson 3: Customers Lead Technology, Not the Other Way Around
    Lesson 4: Change Is Inevitable, so Be Prepared
    Lesson 5: Build Empathy and Prioritize Ruthlessly
    Lesson 6: Big Impact Does Not Happen Overnight
    Summary
    A. Cloud Data Lake Decision Framework Template
    Phase 1: Assess Framework
    Phase 2: Define Framework
    Planning the Cloud Data Lake Deliverables
    Phase 3: Implement Framework
    Index