Cost-Effective Data Pipelines: Balancing Trade-Offs When Developing Pipelines in the Cloud

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, highly available products while keeping costs in check? With this practical guide, author Sev Leonard provides a holistic approach to designing scalable data pipelines in the cloud. Intermediate data engineers, software developers, and architects will learn how to navigate cost/performance trade-offs and how to choose and configure compute and storage. You'll also pick up best practices for code development, testing, and monitoring. By focusing on the entire design process, you'll be able to deliver cost-effective, high-quality products. This book helps you: • Reduce cloud spend with lower cost cloud service offerings and smart design strategies • Minimize waste without sacrificing performance by rightsizing compute resources • Drive pipeline evolution, head off performance issues, and quickly debug with effective monitoring • Set up development and test environments that minimize cloud service dependencies • Create data pipeline code bases that are testable and extensible, fostering rapid development and evolution • Improve data quality and pipeline operation through validation and testing

Author(s): Sev Leonard
Edition: 1
Publisher: O'Reilly Media
Year: 2023

Language: English
Commentary: Publisher's PDF
Pages: 286
City: Sebastopol, CA
Tags: Amazon Web Services; Cloud Computing; Software Engineering; Monitoring; Logging; Unit Testing; Scaling; Data Pipelines; Mock Testing; Synthetic Data

Cover
Copyright
Table of Contents
Preface
Who This Book Is For
What You Will Learn
What This Book Is Not
Running Example
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Designing Compute for Data Pipelines
Understanding Availability of Cloud Compute
Outages
Capacity Limits
Account Limits
Infrastructure
Leveraging Different Purchasing Options in Pipeline Design
On Demand
Spot/Interruptible
Contractual Discounts
Contractual Discounts in the Real World: A Cautionary Tale
Requirements Gathering for Compute Design
Business Requirements
Architectural Requirements
Requirements-Gathering Example: HoD Batch Ingest
Benchmarking
Instance Family Identification
Cluster Sizing
Monitoring
Benchmarking Example
Undersized
Oversized
Right-Sized
Summary
Recommended Readings
Chapter 2. Responding to Changes in Demand by Scaling Compute
Identifying Scaling Opportunities
Variation in Data Pipelines
Scaling Metrics
Pipeline Scaling Example
Designing for Scaling
Implementing Scaling Plans
Scaling Mechanics
Common Autoscaling Pitfalls
Autoscaling Example
Summary
Recommended Readings
Chapter 3. Data Organization in the Cloud
Cloud Storage Costs
Storage at Rest
Egress
Data Access
Cloud Storage Organization
Storage Bucket Strategies
Lifecycle Configurations
File Structure Design
File Formats
Partitioning
Compaction
Summary
Recommended Readings
Chapter 4. Economical Pipeline Fundamentals
Idempotency
Preventing Data Duplication
Tolerating Data Duplication
Checkpointing
Automatic Retries
Retry Considerations
Retry Levels in Data Pipelines
Data Validation
Validating Data Characteristics
Schemas
Summary
Chapter 5. Setting Up Effective Development Environments
Environments
Software Environments
Data Environments
Data Pipeline Environments
Environment Planning
Local Development
Containers
Resource Dependency Reduction
Resource Cleanup
Summary
Chapter 6. Software Development Strategies
Managing Different Coding Environments
Example: A Multimodal Pipeline
Example: How Code Becomes Difficult to Change
Modular Design
Single Responsibility
Dependency Inversion
Modular Design with DataFrames
Configurable Design
Summary
Recommended Readings
Chapter 7. Unit Testing
The Role of Unit Testing in Data Pipelines
Unit Testing Overview
Example: Identifying Unit Testing Needs
Pipeline Areas to Unit-Test
Data Logic
Connections
Observability
Data Modification Processes
Cloud Components
Working with Dependencies
Interfaces
Data
Example: Unit Testing Plan
Identifying Components to Test
Identifying Dependencies
Summary
Chapter 8. Mocks
Considerations for Replacing Dependencies
Placement
Dependency Stability
Complexity Versus Criticality
Mocking Generic Interfaces
Responses
Requests
Connectivity
Mocking Cloud Services
Building Your Own Mocks
Mocking with Moto
Testing with Databases
Test Database Example
Working with Test Databases
Summary
Further Exploration
More Moto Mocks
Mock Placement
Chapter 9. Data for Testing
Working with Live Data
Benefits
Challenges
Working with Synthetic Data
Benefits
Challenges
Is Synthetic Data the Right Approach?
Manual Data Generation
Automated Data Generation
Synthetic Data Libraries
Schema-Driven Generation
Property-Based Testing
Summary
Chapter 10. Logging
Logging Costs
Impact of Scale
Impact of Cloud Storage Elasticity
Reducing Logging Costs
Effective Logging
Summary
Chapter 11. Finding Your Way with Monitoring
Costs of Inadequate Monitoring
Getting Lost in the Woods
Navigation to the Rescue
System Monitoring
Data Volume
Throughput
Consumer Lag
Worker Utilization
Resource Monitoring
Understanding the Bounds
Understanding Reliability Impacts
Pipeline Performance
Pipeline Stage Duration
Profiling
Errors to Watch Out For
Query Monitoring
Minimizing Monitoring Costs
Summary
Recommended Readings
Chapter 12. Essential Takeaways
An Ounce of Prevention Is Worth a Pound of Cure
Reign In Compute Spend
Organize Your Resources
Design for Interruption
Build In Data Quality
Change Is the Only Constant
Design for Change
Monitor for Change
Parting Thoughts
Appendix A. Preparing a Cloud Budget
It’s All About the Details
Historical Data
Estimating for New Projects
Changes That Impact Costs
Creating a Budget
Budget Summary
Changes Between Previous and Next Budget Periods
Cost Breakdown
Communicating the Budget
Summary
Index
About the Author
Colophon