Data lakes and warehouses have become increasingly fragile, costly, and difficult to maintain as data gets bigger and moves faster. Data meshes can help your organization decentralize data, giving ownership back to the engineers who produced it. This book provides a concise yet comprehensive overview of data mesh patterns for streaming and real-time data services.
Authors Hubert Dulay and Stephen Mooney examine the vast differences between streaming and batch data meshes. Data engineers, architects, data product owners, and those in DevOps and MLOps roles will learn steps for implementing a streaming data mesh, from defining a data domain to building a good data product. Through the course of the book, you'll create a complete self-service data platform and devise a data governance system that enables your mesh to work seamlessly.
With this book, you will:
- Design a streaming data mesh using Kafka
- Learn how to identify a domain
- Build your first data product using self-service tools
- Apply data governance to the data products you create
- Learn the differences between synchronous and asynchronous data services
- Implement self-services that support decentralized data
Author(s): Hubert Dulay, Stephen Mooney
Edition: 1
Publisher: O'Reilly Media
Year: 2023
Language: English
Commentary: Publisher PDF | Published: June 2023 | Revision History: 2023-05-11: First Release
Pages: 223
City: Sebastopol, CA
Tags: Data Mesh; Apache Kafka; Decentralized Data; Distributed Data; Stream-Processing; Data Services; Real-Time
Cover
Copyright
Table of Contents
Preface
Who Should Read This Book
Why We Wrote This Book
Navigating This Book
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Hubert
Stephen
Chapter 1. Data Mesh Introduction
Data Divide
Data Mesh Pillars
Data Ownership
Data as a Product
Federated Computational Data Governance
Self-Service Data Platform
Data Mesh Diagram
Other Similar Architectural Patterns
Data Fabric
Data Gateways and Data Services
Data Democratization
Data Virtualization
Focusing on Implementation
Apache Kafka
AsyncAPI
Chapter 2. Streaming Data Mesh Introduction
The Streaming Advantage
Streaming Enables Real-Time Use Cases
Streaming Enables Data Optimization Advantages
Reverse ETL
The Kappa Architecture
Lambda Architecture Introduction
Kappa Architecture Introduction
Summary
Chapter 3. Domain Ownership
Identifying Domains
Discernible Domains
Geographic Regions
Hybrid Architecture
Multicloud
Avoiding Ambiguous Domains
Domain-Driven Design
Domain Model
Domain Logic
Bounded Context
The Ubiquitous Language
Data Mesh Domain Roles
Data Product Engineer
Data Product Owner or Data Steward
Streaming Data Mesh Tools and Platforms to Consider
Domain Charge-Backs
Summary
Chapter 4. Streaming Data Products
Defining Data Product Requirements
Identifying Data Product Derivatives
Derivatives from Other Domains
Ingesting Data Product Derivatives with Kafka Connect
Consumability
Synchronous Data Sources
Asynchronous Data Sources and Change Data Capture
Debezium Connectors
Transforming Data Derivatives to Data Products
Data Standardization
Protecting Sensitive Information
SQL
Extract, Transform, and Load
Publishing Data Products with AsyncAPI
Registering the Streaming Data Product
Building an AsyncAPI YAML Document
Assigning Data Tags
Versioning
Monitoring
Summary
Chapter 5. Federated Computational Data Governance
Data Governance in a Streaming Data Mesh
Data Lineage Graph
Streaming Data Catalog to Organize Data Products
Metadata
Schemas
Lineage
Security
Scalability
Generating the Data Product Page from AsyncAPI
Apicurio Registry
Access Workflow
Centralized Versus Decentralized
Centralized Engineers
Decentralized (Domain) Engineers
Summary
Chapter 6. Self-Service Data Infrastructure
Streaming Data Mesh CLI
Resource-Related Commands
Cluster-Related Commands
Topic-Related Commands
The domain Commands
The connect Commands
The streaming Commands
Publishing a Streaming Data Product
Data Governance-Related Services
Security Services
Standards Services
Lineage Services
SaaS Services and APIs
Summary
Chapter 7. Architecting a Streaming Data Mesh
Infrastructure
Two Architecture Solutions
Dedicated Infrastructure
Multitenant Infrastructure
Streaming Data Mesh Central Architecture
The Domain Agent (aka Sidecar)
Data Plane
Control Plane
Summary
Chapter 8. Building a Decentralized Data Team
The Traditional Data Warehouse Structure
Introducing the Decentralized Team Structure
Empowering People
Working Processes
Fostering Collaboration
Data-Driven Automation
New Roles in Data Domains
New Roles in the Data Plane
New Roles in Data Science and Business Intelligence
Chapter 9. Feature Stores
Separating Data Engineering from Data Science
Online and Offline Data Stores
Apache Feast Introduction
Summary
Chapter 10. Streaming Data Mesh in Practice
Streaming Data Mesh Example
Deploying an On-Premises Streaming Data Mesh
Installing a Connector
Deploying Clickstream Connector and Auto-Creating Tables
Deploying the Debezium Postgres CDC Connector
Enrichment of Streaming Data
Publishing the Data Product
Consuming Streaming Data Products
Fully Managed SaaS Services
Summary and Considerations
Index
About the Authors
Colophon