Designing Cloud Data Platforms

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Centralized data warehouses, the long-time defacto standard for housing data for analytics, are rapidly giving way to multi-faceted cloud data platforms. Companies that embrace modern cloud data platforms benefit from an integrated view of their business using all of their data and can take advantage of advanced analytic practices to drive predictions and as yet unimagined data services. Designing Cloud Data Platforms is a hands-on guide to envisioning and designing a modern scalable data platform that takes full advantage of the flexibility of the cloud. As you read, you’ll learn the core components of a cloud data platform design, along with the role of key technologies like Spark and Kafka Streams. You’ll also explore setting up processes to manage cloud-based data, keep it secure, and using advanced analytic and BI tools to analyze it. About the Technology Well-designed pipelines, storage systems, and APIs eliminate the complicated scaling and maintenance required with on-prem data centers. Once you learn the patterns for designing cloud data platforms, you’ll maximize performance no matter which cloud vendor you use. About the book In Designing Cloud Data Platforms, Danil Zburivsky and Lynda Partner reveal a six-layer approach that increases flexibility and reduces costs. Discover patterns for ingesting data from a variety of sources, then learn to harness pre-built services provided by cloud vendors. What's inside • Best practices for structured and unstructured data sets • Cloud-ready machine learning tools • Metadata and real-time analytics • Defensive architecture, access, and security About the reader For data professionals familiar with the basics of cloud computing, and Hadoop or Spark. About the authors Danil Zburivsky has over 10 years of experience designing and supporting large-scale data infrastructure for enterprises across the globe. Lynda Partner is the VP of Analytics-as-a-Service at Pythian, and has been on the business side of data for over 20 years.

Author(s): Danil Zburivsky, Lynda Partner
Edition: 1
Publisher: Manning Publications
Year: 2021

Language: English
Commentary: Vector PDF
Pages: 336
City: Shelter Island, NY
Tags: Google Cloud Platform; Amazon Web Services; Microsoft Azure; Cloud Computing; Analytics; SQL; Relational Databases; NoSQL; Data Lake; Data Warehouse; Data Modeling; Access Management; Data Processing; Metadata; Data Ingestion; Data Security

brief contents
contents
preface
acknowledgments
about this book
Who should read this book
How this book is organized: A roadmap
About the code
liveBook discussion forum
about the authors
about the cover illustration
Chapter 1: Introducing the data platform
1.1 The trends behind the change from data warehouses to data platforms
1.2 Data warehouses struggle with data variety, volume, and velocity
1.2.1 Variety
1.2.2 Volume
1.2.3 Velocity
1.2.4 All the V’s at once
1.3 Data lakes to the rescue?
1.4 Along came the cloud
1.5 Cloud, data lakes, and data warehouses: The emergence of cloud data platforms
1.6 Building blocks of a cloud data platform
1.6.1 Ingestion layer
1.6.2 Storage layer
1.6.3 Processing layer
1.6.4 Serving layer
1.7 How the cloud data platform deals with the three V’s
1.7.1 Variety
1.7.2 Volume
1.7.3 Velocity
1.7.4 Two more V’s
1.8 Common use cases
Chapter 2: Why a data platform and not just a data warehouse
2.1 Cloud data platforms and cloud data warehouses: The practical aspects
2.1.1 A closer look at the data sources
2.1.2 An example cloud data warehouse–only architecture
2.1.3 An example cloud data platform architecture
2.2 Ingesting data
2.2.1 Ingesting data directly into Azure Synapse
2.2.2 Ingesting data into an Azure data platform
2.2.3 Managing changes in upstream data sources
2.3 Processing data
2.3.1 Processing data in the warehouse
2.3.2 Processing data in the data platform
2.4 Accessing data
2.5 Cloud cost considerations
2.6 Exercise answers
Chapter 3: Getting bigger and leveraging the Big 3: Amazon, Microsoft Azure, and Google
3.1 Cloud data platform layered architecture
3.1.1 Data ingestion layer
3.1.2 Fast and slow storage
3.1.3 Processing layer
3.1.4 Technical metadata layer
3.1.5 The serving layer and data consumers
3.1.6 Orchestration and ETL overlay layers
3.2 The importance of layers in a data platform architecture
3.3 Mapping cloud data platform layers to specific tools
3.3.1 AWS
3.3.2 Google Cloud
3.3.3 Azure
3.4 Open source and commercial alternatives
3.4.1 Batch data ingestion
3.4.2 Streaming data ingestion and real-time analytics
3.4.3 Orchestration layer
3.5 Exercise answers
Chapter 4: Getting data into the platform
4.1 Databases, files, APIs, and streams
4.1.1 Relational databases
4.1.2 Files
4.1.3 SaaS data via API
4.1.4 Streams
4.2 Ingesting data from relational databases
4.2.1 Ingesting data from RDBMSs using a SQL interface
4.2.2 Full-table ingestion
4.2.3 Incremental table ingestion
4.2.4 Change data capture (CDC)
4.2.5 CDC vendors overview
4.2.6 Data type conversion
4.2.7 Ingesting data from NoSQL databases
4.2.8 Capturing important metadata for RDBMS or NoSQL ingestion pipelines
4.3 Ingesting data from files
4.3.1 Tracking ingested files
4.3.2 Capturing file ingestion metadata
4.4 Ingesting data from streams
4.4.1 Differences between batch and streaming ingestion
4.4.2 Capturing streaming pipeline metadata
4.5 Ingesting data from SaaS applications
4.5.1 No standard approach to API design
4.5.2 No standard way to deal with full vs. incremental data exports
4.5.3 Resulting data is typically highly nested JSON
4.6 Network and security considerations for data ingestion into the cloud
4.6.1 Connecting other networks to your cloud data platform
4.7 Exercise answers
Chapter 5: Organizing and processing data
5.1 Processing as a separate layer in the data platform
5.2 Data processing stages
5.3 Organizing your cloud storage
5.3.1 Cloud storage containers and folders
5.4 Common data processing steps
5.4.1 File format conversion
5.4.2 Data deduplication
5.4.3 Data quality checks
5.5 Configurable pipelines
5.6 Exercise answers
Chapter 6: Real-time data processing and analytics
6.1 Real-time ingestion vs. real-time processing
6.2 Use cases for real-time data processing
6.2.1 Retail use case: Real-time ingestion
6.2.2 Online gaming use case: Real-time ingestion and real-time processing
6.2.3 Summary of real-time ingestion vs. real-time processing
6.3 When should you use real-time ingestion and/or real-time processing?
6.4 Organizing data for real-time use
6.4.1 The anatomy of fast storage
6.4.2 How does fast storage scale?
6.4.3 Organizing data in the real-time storage
6.5 Common data transformations in real time
6.5.1 Causes of duplicates in real-time systems
6.5.2 Deduplicating data in real-time systems
6.5.3 Converting message formats in real-time pipelines
6.5.4 Real-time data quality checks
6.5.5 Combining batch and real-time data
6.6 Cloud services for real-time data processing
6.6.1 AWS real-time processing services
6.6.2 Google Cloud real-time processing services
6.6.3 Azure real-time processing services
6.7 Exercise answers
Chapter 7: Metadata layer architecture
7.1 What we mean by metadata
7.1.1 Business metadata
7.1.2 Data platform internal metadata or “pipeline metadata”
7.2 Taking advantage of pipeline metadata
7.3 Metadata model
7.3.1 Metadata domains
7.4 Metadata layer implementation options
7.4.1 Metadata layer as a collection of configuration files
7.4.2 Metadata database
7.4.3 Metadata API
7.5 Overview of existing solutions
7.5.1 Cloud metadata services
7.5.2 Open source metadata layer implementations
7.6 Exercise answers
Chapter 8: Schema management
8.1 Why schema management
8.1.1 Schema changes in a traditional data warehouse architecture
8.1.2 Schema-on-read approach
8.2 Schema-management approaches
8.2.1 Schema as a contract
8.2.2 Schema management in the data platform
8.2.3 Monitoring schema changes
8.3 Schema Registry Implementation
8.3.1 Apache Avro schemas
8.3.2 Existing Schema Registry implementations
8.3.3 Schema Registry as part of a Metadata layer
8.4 Schema evolution scenarios
8.4.1 Schema compatibility rules
8.4.2 Schema evolution and data transformation pipelines
8.5 Schema evolution and data warehouses
8.5.1 Schema-management features of cloud data warehouses
8.6 Exercise answers
Chapter 9: Data access and security
9.1 Different types of data consumers
9.2 Cloud data warehouses
9.2.1 AWS Redshift
9.2.2 Azure Synapse
9.2.3 Google BigQuery
9.2.4 Choosing the right data warehouse
9.3 Application data access
9.3.1 Cloud relational databases
9.3.2 Cloud key/value data stores
9.3.3 Full-text search services
9.3.4 In-memory cache
9.4 Machine learning on the data platform
9.4.1 Machine learning model lifecycle on a cloud data platform
9.4.2 ML cloud collaboration tools
9.5 Business intelligence and reporting tools
9.5.1 Traditional BI tools and cloud data platform integration
9.5.2 Using Excel as a BI tool
9.5.3 BI tools that are external to the cloud provider
9.6 Data security
9.6.1 Users, groups, and roles
9.6.2 Credentials and configuration management
9.6.3 Data encryption
9.6.4 Network boundaries
9.7 Exercise Answers
Chapter 10: Fueling business value with data platforms
10.1 Why you need a data strategy
10.2 The analytics maturity journey
10.2.1 SEE: Getting insights from data
10.2.2 PREDICT: Using data to predict what to do
10.2.3 DO: Making your analytics actionable
10.2.4 CREATE: Going beyond analytics into products
10.3 The data platform: The engine that powers analytics maturity
10.4 Platform project stoppers
10.4.1 Time does indeed kill
10.4.2 User adoption
10.4.3 User trust and the need for data governance
10.4.4 Operating in a platform silo
10.4.5 The dollar dance
index
A
B
C
D
E
F
G
H
I
K
L
M
N
O
P
R
S
T
U
V
W