Take advantage of today's sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and LinkedIn share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges.
Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers.
# Topics include:
The Importance of Data Lineage - Julien Le Dem
Data Security for Data Engineers - Katharine Jarmul
The Two Types of Data Engineering and Data Engineers - Jesse Anderson
Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy
The End of ETL as We Know It - Paul Singman
Building a Career as a Data Engineer - Vijay Kiran
Modern Metadata for the Modern Data Stack - Prukalpa Sankar
Your Data Tests Failed! Now What? - Sam Bail
Author(s): Tobias Macey
Edition: Early Release
Publisher: O'Reilly
Year: 2021
Language: English
Pages: 256
Tags: Data Engineering, Data
1. Three Distributed Programming Concepts to Be Aware of When Choosing an Open Source Framework
Adi Polak
MapReduce Algorithm
Distributed Shared Memory Model
Message Passing/Actors Model
Conclusions
Resources
2. Seven Things Data Engineers Need to Watch Out for in ML Projects
Dr. Sandeep Uttamchandani
3. A (Book) Case for Eventual Consistency: A Short Story of Keeping Inventory at a Bookstore to Explain Strong and Eventual Consistency in Software Architecture
Denise Koessler Gosnell, PhD
4. A/B and How to Be
Sonia Mehta
5. About the Storage Layer
Julien Le Dem
6. Analytics as the Secret Glue for Microservice Architecture
Elias Nema
7. Automate Your Infrastructure
Christiano Anderson
8. Automate Your Pipeline Tests
Tom White
Build an End-to-End Test of the Whole Pipeline at the Start
Use a Small Amount of Representative Data
Prefer Textual Data Formats over Binary for Testing
Ensure That Tests Can Be Run Locally
Make Tests Deterministic
Make It Easy to Add More Tests
9. Be Intentional About the Batching Model in Your Data Pipelines
Raghotham Murthy
Data Time Window Batching Model
Arrival Time Window Batching Model
ATW and DTW Batching in the Same Pipeline
10. Beware of Silver-Bullet Syndrome: Do You Really Want Your Professional Identity to Be a Tool Stack?
Thomas Nield
11. Building a Career as a Data Engineer
Vijay Kiran
12. Caution: Data Science Projects Can Turn into the Emperor’s New Clothes
Shweta Katre
13. Change Data Capture
Raghotham Murthy
14. Column Names as Contracts
Emily Riederer
15. Consensual, Privacy-Aware Data Collection—Brought to You by Data Engineers
Katharine Jarmul
Attach Consent Metadata
Track Data Provenance
Drop or Encrypt Sensitive Fields
16. Cultivate Good Working Relationships with Data Consumers
Ido Shlomo
Don’t Let Consumers Solve Engineering Problems
Adapt Your Expectations
Understand Consumers’ Jobs
17. Data Engineering != Spark
Jesse Anderson
Batch and Real-Time Systems
Computation Component
Storage Component
NoSQL databases
Messaging Component
18. Data Engineering from a Data Scientist’s Perspective
Bill Franks
Database Administration, ETL, and Such
Why the Need for Data Engineers?
What’s the Future?
19. Data Engineering for Autonomy and Rapid Innovation
Jeff Magnusson
Implement Reusable Patterns in the ETL Framework
Choose a Framework and Tool Set Accessible Within the Organization
Move the Logic to the Edges of the Pipelines
Create and Support Staging Tables
Bake Data-Flow Logic into Tooling and Infrastructure
20. Data Observability: The Next Frontier of Data Engineering
Barr Moses
How Good Data Turns Bad
Introducing: Data Observability
21. The Data Pipeline Is Not About Speed
Rustem Fyzhanov
22. Data Pipelines—Design Patterns for Reusability, Extensibility: Evolve Data Pipelines for Quality, Flexibility, Transparency, and Growth
Mukul Sood
23. Data Quality for Data Engineers
Katharine Jarmul
24. Data Security for Data Engineers
Katharine Jarmul
Learn About Security
Monitor, Log, and Test Access
Encrypt Data
Automate Security Tests
Ask for Help
25. Data Validation Is More Than Summary Statistics
Emily Riederer
26. Data Warehouses Are the Past, Present, and Future
James Densmore
27. Defining and Managing Messages in Log-Centric Architectures
Boris Lublinsky
28. Demystify the Source and Illuminate the Data Pipeline
Meghan Kwartler
29. Develop Communities, Not Just Code
Emily Riederer
30. Effective Data Engineering in the Cloud World
Dipti Borkar
Disaggregated Data Stack
Orchestrate, Orchestrate, Orchestrate
Copying Data Creates Problems
S3 Compatibility
SQL and Structured Data Are Still In
31. Embrace the Data Lake Architecture
Vinoth Chandar
Common Pitfalls
Data Lakes
Advantages
Implementation
32. Embracing Data Silos: The Journey Through a Fragmented Data World
Bin Fan and Amelia Wong
Why Data Silos Exist
Embracing Data Silos
33. Engineering Reproducible Data Science Projects
Michael Li
34. Every Data Pipeline Needs a Real-Time Dashboard of Business Data
Valliappa (Lak) Lakshmanan
35. Five Best Practices for Stable Data Processing: What to Keep in Mind When Setting Up Processes Like ETL or ELT
Christian Lauer
Prevent Errors
Set Fair Processing Times
Use Data-Quality Measurement Jobs
Ensure Transaction Security
Consider Dependency on Other Systems
Conclusion
36. Focus on Maintainability and Break Up Those ETL Tasks
Chris Morandi
37. Friends Don’t Let Friends Do Dual-Writes
Gunnar Morling
38. Fundamental Knowledge
Pedro Marcelino
39. Getting the “Structured” Back into SQL
Elias Nema
40. Give Data Products a Frontend with Latent Documentation
Emily Riederer
41. The Hidden Cost of Data Input/Output
Lohit VijayaRenu
Data Compression
Data Format
Data Serialization
42. How to Build Your Data Platform Like a Product: Go from a (Standard Dev. of) Zero to Bona Fide Data Hero
Barr Moses and Atul Gupte
Align Your Product’s Goals with the Goals of the Business
Gain Feedback and Buy-in from the Right Stakeholders
Prioritize Long-Term Growth and Sustainability over Short-Term Gains
Sign Off on Baseline Metrics for Your Data and How You Measure It
43. How to Prevent a Data Mutiny
Sean Knapp
44. Know Your Latencies
Dhruba Borthakur
45. Know the Value per Byte of Your Data
Dhruba Borthakur
46. Learn to Use a NoSQL Database, but Not Like an RDBMS
Kirk Kirkconnell
47. Let the Robots Enforce the Rules
Anthony Burdi
48. Listen to Your Users—but Not Too Much
Amanda Tomlinson
49. Low-Cost Sensors and the Quality of Data
Dr. Shivanand Prabhoolall Guness
50. Maintain Your Mechanical Sympathy
Tobias Macey
51. Metadata Service(s) as a Core Component of the Data Platform
Lohit VijayaRenu
Discoverability
Security Control
Schema Management
Application Interface and Service Guarantee
52. Metadata ≥ Data
Jonathan Seidman
53. Mind the Gap: Your Data Lake Provides No ACID Guarantees
Einat Orr
54. Modern Metadata for the Modern Data Stack
Prukalpa Sankar
Data Assets > Tables
Complete Data Visibility, Not Piecemeal Solutions
Built for Metadata That Itself Is Big Data
Embedded Collaboration at Its Heart
55. Most Data Problems Are Not Big Data Problems
Thomas Nield
56. Moving from Software Engineering to Data Engineering
John Salinas
57. Perfect Is the Enemy of Good
Bob Haffner
58. Pipe Dreams
Scott Haines
59. Preventing the Data Lake Abyss: How to Ensure That Your Data Remains Valid Over the Years
Scott Haines
Establishing Data Contracts
From Generic Data Lake to Data Structure Store
60. Privacy Is Your Problem
Stephen Bailey, PhD
61. QA and All Its Sexiness
Sonia Mehta
62. Scaling ETL: How Data Pipelines Evolve as Your Business Grows
Chris Heinzmann
63. Scaling Is Easy / Scaling Is Hard: The Yin and Yang of Big Data Scalability
Paul Brebner
64. Six Dimensions for Picking an Analytical Data Warehouse
Gleb Mezhanskiy
Scalability
Price Elasticity
Interoperability
Querying and Transformation Features
Speed
Zero Maintenance
65. Small Files in a Big Data World: A Data Engineer’s Biggest Nightmare
Adi Polak
What Are Small Files, and Why Are They a Problem?
Why Does It Happen?
Detect and Mitigate
Conclusion
References
66. Streaming Is Different from Batch
Dean Wampler, PhD
67. Tardy Data
Ariel Shaqed
68. Tech Should Take a Back Seat for Data Project Success
Andrew Stevenson
69. Ten Must-Ask Questions for Data-Engineering Projects
Haidar Hadi
Question 1: What Are the Touch Points?
Question 2: What Are the Granularities?
Question 3: What Are the Input and Output Schemas?
Question 4: What Is the Algorithm?
Question 5: Do You Need Backfill Data?
Question 6: When Is the Project Due Date?
Question 7: Why Was That Due Date Set?
Question 8: Which Hosting Environment?
Question 9: What Is the SLA?
Question 10: Who Will Be Taking Over This Project?
70. The Do’s and Don’ts of Data Engineering
Christopher Bergh
Don’t Be a Hero
Don’t Rely on Hope
Don’t Rely on Caution
Do DataOps
71. The End of ETL as We Know It
Paul Singman
Replacing ETL with Intentional Data Transfer
Agreeing on a Data Model Contract
Removing Data Processing Latencies
Taking the First Steps
72. The Haiku Approach to Writing Software
Mitch Seymour
Understand the Constraints Up Front
Start Strong Since Early Decisions Can Impact the Final Product
Keep It as Simple as Possible
Engage the Creative Side of Your Brain
73. The Holy War Between Proprietary and Open Source Is a Lie
Paige Roberts
74. The Implications of the CAP Theorem
Paul Doran
75. The Importance of Data Lineage
Julien Le Dem
76. The Many Meanings of Missingness
Emily Riederer
77. The Six Words That Will Destroy Your Career
Bartosz Mikulski
78. The Three Invaluable Benefits of Open Source for Testing Data Quality
Tom Baeyens
79. The Three Rs of Data Engineering
Tobias Macey
Reliability
Reproducibility
Repeatability
Conclusion
80. The Two Types of Data Engineering and Engineers
Jesse Anderson
Types of Data Engineering
Types of Data Engineers
Why These Differences Matter to You
81. There’s No Such Thing as Data Quality
Emily Riederer
82. Threading and Concurrency: Understanding the 2020 Amazon Kinesis Outage
Matthew Housley, PhD
Operating System Threading
Threading Overhead
Solving the C10K Problem
Scaling Is Not a Magic Bullet
Further Reading
83. Time (Semantics) Won’t Wait
Marta Paes Moreira and Fabian Hueske
84. Tools Don’t Matter, Patterns and Practices Do
Bas Geerdink
85. Total Opportunity Cost of Ownership
Joe Reis
86. Understanding the Ways Different Data Domains Solve Problems
Matthew Seal
87. What Is a Data Engineer? Clue: We’re Data Science Enablers
Lewis Gavin
AI and Machine Learning Models Require Data
Clean Data == Better Model
Finally Building a Model
A Model Is Useful Only If Someone Will Use It
So What Am I Getting At?
88. What Is a Data Mesh—and How Not to Mesh It Up: A Beginner’s Guide to Implementing the Latest Industry Trend
Barr Moses and Lior Gavish
Why Use a Data Mesh?
The Final Link: Observability
89. What Is Big Data?
Ami Levin
90. What to Do When You Don’t Get Any Credit
Jesse Anderson
91. When Our Data Science Team Didn’t Produce Value: An Important Lesson for Leading a Data Team
Joel Nantais
92. When to Avoid the Naive Approach
Nimrod Parasol
93. When to Be Cautious About Sharing Data
Thomas Nield
94. When to Talk and When to Listen
Steven Finkelstein
95. Why Data Science Teams Need Generalists, Not Specialists
Eric Colson
96. With Great Data Comes Great Responsibility
Lohit VijayaRenu
Put Yourself in the User’s Shoes
Ensure Ethical Use of User Information
Watch Your Data Footprint
97. Your Data Tests Failed! Now What?
Sam Bail, PhD
System Response
Logging and Alerting
Alert Response
Stakeholder Communication
Root Cause Identification
Issue Resolution
Index