All cloud architects need to know how to build data platforms that enable businesses to make data-driven decisions and deliver enterprise-wide intelligence in a fast and efficient way. This handbook shows you how to design, build, and modernize cloud native data and machine learning platforms using AWS, Azure, Google Cloud, and multicloud tools like Snowflake and Databricks.
Authors Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner cover the entire data lifecycle from ingestion to activation in a cloud environment using real-world enterprise architectures. You'll learn how to transform, secure, and modernize familiar solutions like data warehouses and data lakes, and you'll be able to leverage recent AI/ML patterns to get accurate and quicker insights to drive competitive advantage.
You'll learn how to:
Design a modern and secure cloud native or hybrid data analytics and machine learning platformAccelerate data-led innovation by...
Author(s): Marco Tranquillin
Publisher: O'Reilly Media
Year: 2023
Language: English
Pages: 359
Preface
Why Do You Need a Cloud Data Platform?
Who Is This Book For?
Organization of This Book
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
1. Modernizing Your Data Platform: An Introductory Overview
The Data Lifecycle
The Journey to Wisdom
Water Pipes Analogy
Collect
Store
Scalability
Performance versus cost
High availability
Durability
Openness
Process/Transform
Analyze/Visualize
Activate
Limitations of Traditional Approaches
Antipattern: Breaking Down Silos Through ETL
Antipattern: Centralization of Control
Antipattern: Data Marts and Hadoop
Creating a Unified Analytics Platform
Cloud Instead of On-Premises
Drawbacks of Data Marts and Data Lakes
Convergence of DWHs and Data Lakes
Lakehouse
Data mesh
Hybrid Cloud
Reasons Why Hybrid Is Necessary
Challenges of Hybrid Cloud
Why Hybrid Can Work
Edge Computing
Applying AI
Machine Learning
Uses of ML
Why Cloud for AI?
Cloud Infrastructure
Democratization
Real Time
MLOps
Core Principles
Summary
2. Strategic Steps to Innovate with Data
Step 1: Strategy and Planning
Strategic Goals
Identify Stakeholders
Change Management
Step 2: Reduce Total Cost of Ownership by Adopting a Cloud Approach
Why Cloud Costs Less
How Much Are the Savings?
When Does Cloud Help?
Step 3: Break Down Silos
Unifying Data Access
Choosing Storage
Semantic Layer
Step 4: Make Decisions in Context Faster
Batch to Stream
Contextual Information
Cost Management
Step 5: Leapfrog with Packaged AI Solutions
Predictive Analytics
Understanding and Generating Unstructured Data
Personalization
Packaged Solutions
Step 6: Operationalize AI-Driven Workflows
Identifying the Right Balance of Automation and Assistance
Building a Data Culture
Populating Your Data Science Team
Step 7: Product Management for Data
Applying Product Management Principles to Data
1. Understand and Maintain a Map of Data Flows in the Enterprise
2. Identify Key Metrics
3. Agreed Criteria, Committed Roadmap, and Visionary Backlog
4. Build for the Customers You Have
5. Don’t Shift the Burden of Change Management
6. Interview Customers to Discover Their Data Needs
7. Whiteboard and Prototype Extensively
8. Build Only What Will Be Used Immediately
9. Standardize Common Entities and KPIs
10. Provide Self-Service Capabilities in Your Data Platform
Summary
3. Designing Your Data Team
Classifying Data Processing Organizations
Data Analysis–Driven Organization
The Vision
The Personas
Data analysts
Business analysts
Data engineers
The Technological Framework
Data Engineering–Driven Organization
The Vision
The Personas
Knowledge
Responsibilities
The Technological Framework
Reference architectures
Benefits of the reference architecture
Data Science–Driven Organization
The Vision
The Personas
The Technological Framework
Summary
4. A Migration Framework
Modernize Data Workflows
Holistic View
Modernize Workflows
Transform the Workflow Itself
A Four-Step Migration Framework
Prepare and Discover
Assess and Plan
Execute
Landing zone
Migrate
Validate
Optimize
Estimating the Overall Cost of the Solution
Audit of the Existing Infrastructure
Request for Information/Proposal and Quotation
Proof of Concept/Minimum Viable Product
Setting Up Security and Data Governance
Framework
Artifacts
Governance over the Life of the Data
Schema, Pipeline, and Data Migration
Schema Migration
Pipeline Migration
Data Migration
Planning
Regional capacity and network to the cloud
Transfer options
Migration Stages
Summary
5. Architecting a Data Lake
Data Lake and the Cloud—A Perfect Marriage
Challenges with On-Premises Data Lakes
Benefits of Cloud Data Lakes
Design and Implementation
Batch and Stream
Data Catalog
Hadoop Landscape
Cloud Data Lake Reference Architecture
Amazon Web Services
Microsoft Azure
Google Cloud Platform
Integrating the Data Lake: The Real Superpower
APIs to Extend the Lake
The Evolution of Data Lake with Apache Iceberg, Apache Hudi, and Delta Lake
Interactive Analytics with Notebooks
Democratizing Data Processing and Reporting
Build Trust in the Data
Data Ingestion Is Still an IT Matter
ML in the Data Lake
Training on Raw Data
Predicting in the Data Lake
Summary
6. Innovating with an Enterprise Data Warehouse
A Modern Data Platform
Organizational Goals
Technological Challenges
Technology Trends and Tools
Hub-and-Spoke Architecture
Data Ingest
Prebuilt connectors
Real-time data
Federated data
Business Intelligence
SQL analytics
Visualization
Embedded analytics
Semantic layer
Transformations
ELT with views
Scheduled queries
Materialized views
Security and lineage
Organizational Structure
DWH to Enable Data Scientists
Query Interface
Storage API
ML Without Moving Your Data
Training ML models
ML training and serving
Exporting trained ML models
Using your trained model in ML pipelines
Invoking external ML models
Loading pretrained ML models
Summary
7. Converging to a Lakehouse
The Need for a Unique Architecture
User Personas
Antipattern: Disconnected Systems
Antipattern: Duplicated Data
Converged Architecture
Two Forms
Choose based on user skills
Complete evaluation criteria
Lakehouse on Cloud Storage
Reference architecture
Migration
Future proofing
SQL-First Lakehouse
Reference architecture
Migration
Future proofing
The Benefits of Convergence
Summary
8. Architectures for Streaming
The Value of Streaming
Industry Use Cases
Streaming Use Cases
Streaming Ingest
Streaming ETL
Streaming ELT
Streaming Insert
Streaming from Edge Devices (IoT)
Streaming Sinks
Real-Time Dashboards
Live Querying
Materialize Some Views
Stream Analytics
Time-Series Analytics
Clickstream Analytics
Anomaly Detection
Resilient Streaming
Continuous Intelligence Through ML
Training Model on Streaming Data
Windowed training
Scheduled training
Continuous evaluation and retraining
Streaming ML Inference
Automated Actions
Summary
9. Extending a Data Platform Using Hybrid and Edge
Why Multicloud?
A Single Cloud Is Simpler and Cost-Effective
Multicloud Is Inevitable
Multicloud Could Be Strategic
Multicloud Architectural Patterns
Single Pane of Glass
Write Once, Run Anywhere
Bursting from On Premises to Cloud
Pass-Through from On Premises to Cloud
Data Integration Through Streaming
Adopting Multicloud
Framework
Time Scale
Define a Target Multicloud Architecture
Why Edge Computing?
Bandwidth, Latency, and Patchy Connectivity
Use Cases
Benefits
Challenges
Edge Computing Architectural Patterns
Smart Devices
Smart Gateways
ML Activation
Adopting Edge Computing
The Initial Context
The Project
Improve overall system observability
Develop automations
Optimize the maintenance
The Final Outcomes and Next Steps
Summary
10. AI Application Architecture
Is This an AI/ML Problem?
Subfields of AI
Generative AI
How it works
Strengths and limitations
Do LLMs memorize or generalize?
LLMs hallucinate
Human feedback is needed
Weaknesses
Use cases
Problems Fit for ML
Buy, Adapt, or Build?
Data Considerations
When to Buy
What Can You Buy?
How Adapting Works
AI Architectures
Understanding Unstructured Data
Generating Unstructured Data
Predicting Outcomes
Forecasting Values
Anomaly Detection
Personalization
Automation
Responsible AI
AI Principles
ML Fairness
Explainability
Summary
11. Architecting an ML Platform
ML Activities
Developing ML Models
Labeling Environment
Development Environment
User Environment
Preparing Data
Training ML Models
Writing ML code
Small-scale jobs
Distributed training
No-code ML
Deploying ML Models
Deploying to an Endpoint
Evaluate Model
Hybrid and Multicloud
Training-Serving Skew
Within the model
Transform function
Feature store
The canonical use of a feature store
Decision chart
Automation
Automate Training and Deployment
Orchestration with Pipelines
Managed pipelines
Airflow
Kubeflow Pipelines
TensorFlow Extended
Continuous Evaluation and Training
Artifacts
Dependency tracking
Continuous evaluation
Continuous retraining
Choosing the ML Framework
Team Skills
Task Considerations
User-Centric
Summary
12. Data Platform Modernization: A Model Case
New Technology for a New Era
The Need for Change
It Is Not Only a Matter of Technology
The Beginning of the Journey
The Current Environment
The Target Environment
The PoC Use Case
The RFP Responses Proposed by Cloud Vendors
The Target Environment
The Approach on Migration
Foundations development
Quick wins migration
Migration fulfillment
Modernization
The RFP Evaluation Process
The Scope of the PoC
The Execution of the PoC
The Final Decision
Peroration
Summary
Index