Architecting Data and Machine Learning Platforms: Enable Analytics and AI-Driven Innovation in the Cloud (Final)

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

All cloud architects need to know how to build data platforms that enable businesses to make data-driven decisions and deliver enterprise-wide intelligence in a fast and efficient way. This handbook shows you how to design, build, and modernize cloud native data and machine learning platforms using AWS, Azure, Google Cloud, and multicloud tools like Snowflake and Databricks.

Authors Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner cover the entire data lifecycle from ingestion to activation in a cloud environment using real-world enterprise architectures. You'll learn how to transform, secure, and modernize familiar solutions like data warehouses and data lakes, and you'll be able to leverage recent AI/ML patterns to get accurate and quicker insights to drive competitive advantage.

You'll learn how to:

  • Design a modern and secure cloud native or hybrid data analytics and machine learning platform
  • Accelerate data-led innovation by...
  • Author(s): Marco Tranquillin
    Publisher: O'Reilly Media
    Year: 2023

    Language: English
    Pages: 359

    Preface
    Why Do You Need a Cloud Data Platform?
    Who Is This Book For?
    Organization of This Book
    Conventions Used in This Book
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    1. Modernizing Your Data Platform: An Introductory Overview
    The Data Lifecycle
    The Journey to Wisdom
    Water Pipes Analogy
    Collect
    Store
    Scalability
    Performance versus cost
    High availability
    Durability
    Openness
    Process/Transform
    Analyze/Visualize
    Activate
    Limitations of Traditional Approaches
    Antipattern: Breaking Down Silos Through ETL
    Antipattern: Centralization of Control
    Antipattern: Data Marts and Hadoop
    Creating a Unified Analytics Platform
    Cloud Instead of On-Premises
    Drawbacks of Data Marts and Data Lakes
    Convergence of DWHs and Data Lakes
    Lakehouse
    Data mesh
    Hybrid Cloud
    Reasons Why Hybrid Is Necessary
    Challenges of Hybrid Cloud
    Why Hybrid Can Work
    Edge Computing
    Applying AI
    Machine Learning
    Uses of ML
    Why Cloud for AI?
    Cloud Infrastructure
    Democratization
    Real Time
    MLOps
    Core Principles
    Summary
    2. Strategic Steps to Innovate with Data
    Step 1: Strategy and Planning
    Strategic Goals
    Identify Stakeholders
    Change Management
    Step 2: Reduce Total Cost of Ownership by Adopting a Cloud Approach
    Why Cloud Costs Less
    How Much Are the Savings?
    When Does Cloud Help?
    Step 3: Break Down Silos
    Unifying Data Access
    Choosing Storage
    Semantic Layer
    Step 4: Make Decisions in Context Faster
    Batch to Stream
    Contextual Information
    Cost Management
    Step 5: Leapfrog with Packaged AI Solutions
    Predictive Analytics
    Understanding and Generating Unstructured Data
    Personalization
    Packaged Solutions
    Step 6: Operationalize AI-Driven Workflows
    Identifying the Right Balance of Automation and Assistance
    Building a Data Culture
    Populating Your Data Science Team
    Step 7: Product Management for Data
    Applying Product Management Principles to Data
    1. Understand and Maintain a Map of Data Flows in the Enterprise
    2. Identify Key Metrics
    3. Agreed Criteria, Committed Roadmap, and Visionary Backlog
    4. Build for the Customers You Have
    5. Don’t Shift the Burden of Change Management
    6. Interview Customers to Discover Their Data Needs
    7. Whiteboard and Prototype Extensively
    8. Build Only What Will Be Used Immediately
    9. Standardize Common Entities and KPIs
    10. Provide Self-Service Capabilities in Your Data Platform
    Summary
    3. Designing Your Data Team
    Classifying Data Processing Organizations
    Data Analysis–Driven Organization
    The Vision
    The Personas
    Data analysts
    Business analysts
    Data engineers
    The Technological Framework
    Data Engineering–Driven Organization
    The Vision
    The Personas
    Knowledge
    Responsibilities
    The Technological Framework
    Reference architectures
    Benefits of the reference architecture
    Data Science–Driven Organization
    The Vision
    The Personas
    The Technological Framework
    Summary
    4. A Migration Framework
    Modernize Data Workflows
    Holistic View
    Modernize Workflows
    Transform the Workflow Itself
    A Four-Step Migration Framework
    Prepare and Discover
    Assess and Plan
    Execute
    Landing zone
    Migrate
    Validate
    Optimize
    Estimating the Overall Cost of the Solution
    Audit of the Existing Infrastructure
    Request for Information/Proposal and Quotation
    Proof of Concept/Minimum Viable Product
    Setting Up Security and Data Governance
    Framework
    Artifacts
    Governance over the Life of the Data
    Schema, Pipeline, and Data Migration
    Schema Migration
    Pipeline Migration
    Data Migration
    Planning
    Regional capacity and network to the cloud
    Transfer options
    Migration Stages
    Summary
    5. Architecting a Data Lake
    Data Lake and the Cloud—A Perfect Marriage
    Challenges with On-Premises Data Lakes
    Benefits of Cloud Data Lakes
    Design and Implementation
    Batch and Stream
    Data Catalog
    Hadoop Landscape
    Cloud Data Lake Reference Architecture
    Amazon Web Services
    Microsoft Azure
    Google Cloud Platform
    Integrating the Data Lake: The Real Superpower
    APIs to Extend the Lake
    The Evolution of Data Lake with Apache Iceberg, Apache Hudi, and Delta Lake
    Interactive Analytics with Notebooks
    Democratizing Data Processing and Reporting
    Build Trust in the Data
    Data Ingestion Is Still an IT Matter
    ML in the Data Lake
    Training on Raw Data
    Predicting in the Data Lake
    Summary
    6. Innovating with an Enterprise Data Warehouse
    A Modern Data Platform
    Organizational Goals
    Technological Challenges
    Technology Trends and Tools
    Hub-and-Spoke Architecture
    Data Ingest
    Prebuilt connectors
    Real-time data
    Federated data
    Business Intelligence
    SQL analytics
    Visualization
    Embedded analytics
    Semantic layer
    Transformations
    ELT with views
    Scheduled queries
    Materialized views
    Security and lineage
    Organizational Structure
    DWH to Enable Data Scientists
    Query Interface
    Storage API
    ML Without Moving Your Data
    Training ML models
    ML training and serving
    Exporting trained ML models
    Using your trained model in ML pipelines
    Invoking external ML models
    Loading pretrained ML models
    Summary
    7. Converging to a Lakehouse
    The Need for a Unique Architecture
    User Personas
    Antipattern: Disconnected Systems
    Antipattern: Duplicated Data
    Converged Architecture
    Two Forms
    Choose based on user skills
    Complete evaluation criteria
    Lakehouse on Cloud Storage
    Reference architecture
    Migration
    Future proofing
    SQL-First Lakehouse
    Reference architecture
    Migration
    Future proofing
    The Benefits of Convergence
    Summary
    8. Architectures for Streaming
    The Value of Streaming
    Industry Use Cases
    Streaming Use Cases
    Streaming Ingest
    Streaming ETL
    Streaming ELT
    Streaming Insert
    Streaming from Edge Devices (IoT)
    Streaming Sinks
    Real-Time Dashboards
    Live Querying
    Materialize Some Views
    Stream Analytics
    Time-Series Analytics
    Clickstream Analytics
    Anomaly Detection
    Resilient Streaming
    Continuous Intelligence Through ML
    Training Model on Streaming Data
    Windowed training
    Scheduled training
    Continuous evaluation and retraining
    Streaming ML Inference
    Automated Actions
    Summary
    9. Extending a Data Platform Using Hybrid and Edge
    Why Multicloud?
    A Single Cloud Is Simpler and Cost-Effective
    Multicloud Is Inevitable
    Multicloud Could Be Strategic
    Multicloud Architectural Patterns
    Single Pane of Glass
    Write Once, Run Anywhere
    Bursting from On Premises to Cloud
    Pass-Through from On Premises to Cloud
    Data Integration Through Streaming
    Adopting Multicloud
    Framework
    Time Scale
    Define a Target Multicloud Architecture
    Why Edge Computing?
    Bandwidth, Latency, and Patchy Connectivity
    Use Cases
    Benefits
    Challenges
    Edge Computing Architectural Patterns
    Smart Devices
    Smart Gateways
    ML Activation
    Adopting Edge Computing
    The Initial Context
    The Project
    Improve overall system observability
    Develop automations
    Optimize the maintenance
    The Final Outcomes and Next Steps
    Summary
    10. AI Application Architecture
    Is This an AI/ML Problem?
    Subfields of AI
    Generative AI
    How it works
    Strengths and limitations
    Do LLMs memorize or generalize?
    LLMs hallucinate
    Human feedback is needed
    Weaknesses
    Use cases
    Problems Fit for ML
    Buy, Adapt, or Build?
    Data Considerations
    When to Buy
    What Can You Buy?
    How Adapting Works
    AI Architectures
    Understanding Unstructured Data
    Generating Unstructured Data
    Predicting Outcomes
    Forecasting Values
    Anomaly Detection
    Personalization
    Automation
    Responsible AI
    AI Principles
    ML Fairness
    Explainability
    Summary
    11. Architecting an ML Platform
    ML Activities
    Developing ML Models
    Labeling Environment
    Development Environment
    User Environment
    Preparing Data
    Training ML Models
    Writing ML code
    Small-scale jobs
    Distributed training
    No-code ML
    Deploying ML Models
    Deploying to an Endpoint
    Evaluate Model
    Hybrid and Multicloud
    Training-Serving Skew
    Within the model
    Transform function
    Feature store
    The canonical use of a feature store
    Decision chart
    Automation
    Automate Training and Deployment
    Orchestration with Pipelines
    Managed pipelines
    Airflow
    Kubeflow Pipelines
    TensorFlow Extended
    Continuous Evaluation and Training
    Artifacts
    Dependency tracking
    Continuous evaluation
    Continuous retraining
    Choosing the ML Framework
    Team Skills
    Task Considerations
    User-Centric
    Summary
    12. Data Platform Modernization: A Model Case
    New Technology for a New Era
    The Need for Change
    It Is Not Only a Matter of Technology
    The Beginning of the Journey
    The Current Environment
    The Target Environment
    The PoC Use Case
    The RFP Responses Proposed by Cloud Vendors
    The Target Environment
    The Approach on Migration
    Foundations development
    Quick wins migration
    Migration fulfillment
    Modernization
    The RFP Evaluation Process
    The Scope of the PoC
    The Execution of the PoC
    The Final Decision
    Peroration
    Summary
    Index