97 Things Every Data Engineer Should Know

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Take advantage of today's sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and LinkedIn share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges. Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers. Topics include: • The Importance of Data Lineage - Julien Le Dem • Data Security for Data Engineers - Katharine Jarmul • The Two Types of Data Engineering and Data Engineers - Jesse Anderson • Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy • The End of ETL as We Know It - Paul Singman • Building a Career as a Data Engineer - Vijay Kiran • Modern Metadata for the Modern Data Stack - Prukalpa Sankar • Your Data Tests Failed! Now What? - Sam Bail

Author(s): Tobias Macey
Edition: 1
Publisher: O'Reilly Media
Year: 2021

Language: English
Commentary: Vector PDF
Pages: 264
City: Sebastopol, CA
Tags: Cloud Computing;Data Science;Analytics;Multithreading;Privacy;Concurrency;SQL;NoSQL;Apache Spark;Stream Processing;Logging;Microservices;Pipelines;Maintanability;Sensor Data;Best Practices;User Experience;Community;Data Lake;Data Warehouse;Automation;Testing;Quality Assurance;Distributed Processing;Storage Management;Career;Data Collection;Data Processing;Reproducible Research;Data Pipelines;Metadata;A/B Testing;Data Engineering;Latency;Data Validation;Observability;Data Security

Cover
Copyright
Table of Contents
Preface
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. A (Book) Case for Eventual Consistency
Denise Koessler Gosnell, PhD
Chapter 2. A/B and How to Be
Sonia Mehta
Chapter 3. About the Storage Layer
Julien Le Dem
Chapter 4. Analytics as the Secret Glue for Microservice Architectures
Elias Nema
Chapter 5. Automate Your Infrastructure
Christiano Anderson
Chapter 6. Automate Your Pipeline Tests
Tom White
Build an End-to-End Test of the Whole Pipeline
Use a Small Amount of Representative Data
Prefer Textual Data Formats over Binary
Ensure That Tests Can Be Run Locally
Make Tests Deterministic
Make It Easy to Add More Tests
Chapter 7. Be Intentional About the Batching Model in Your Data Pipelines
Raghotham Murthy
Data Time Window Batching Model
Arrival Time Window Batching Model
ATW and DTW Batching in the Same Pipeline
Chapter 8. Beware of Silver-Bullet Syndrome
Thomas Nield
Chapter 9. Building a Career as a Data Engineer
Vijay Kiran
Chapter 10. Business Dashboards for Data Pipelines
Valliappa (Lak) Lakshmanan
Chapter 11. Caution: Data Science Projects Can Turn into the Emperor’s New Clothes
Shweta Katre
Chapter 12. Change Data Capture
Raghotham Murthy
Chapter 13. Column Names as Contracts
Emily Riederer
Chapter 14. Consensual, Privacy-Aware Data Collection
Katharine Jarmul
Attach Consent Metadata
Track Data Provenance
Drop or Encrypt Sensitive Fields
Chapter 15. Cultivate Good Working Relationships with Data Consumers
Ido Shlomo
Don’t Let Consumers Solve Engineering Problems
Adapt Your Expectations
Understand Consumers’ Jobs
Chapter 16. Data Engineering != Spark
Jesse Anderson
Batch and Real-Time Systems
Computation Component
Storage Component
NoSQL Databases
Messaging Component
Chapter 17. Data Engineering for Autonomy and Rapid Innovation
Jeff Magnusson
Implement Reusable Patterns in the ETL Framework
Choose a Framework and Tool Set Accessible Within the Organization
Move the Logic to the Edges of the Pipelines
Create and Support Staging Tables
Bake Data-Flow Logic into Tooling and Infrastructure
Chapter 18. Data Engineering from a Data Scientist’s Perspective
Bill Franks
Database Administration, ETL, and Such
Why the Need for Data Engineers?
What’s the Future?
Chapter 19. Data Pipeline Design Patterns for Reusability and Extensibility
Mukul Sood
Chapter 20. Data Quality for Data Engineers
Katharine Jarmul
Chapter 21. Data Security for Data Engineers
Katharine Jarmul
Learn About Security
Monitor, Log, and Test Access
Encrypt Data
Automate Security Tests
Ask for Help
Chapter 22. Data Validation Is More Than Summary Statistics
Emily Riederer
Chapter 23. Data Warehouses Are the Past, Present, and Future
James Densmore
Chapter 24. Defining and Managing Messages in Log-Centric Architectures
Boris Lublinsky
Chapter 25. Demystify the Source and Illuminate the Data Pipeline
Meghan Kwartler
Chapter 26. Develop Communities, Not Just Code
Emily Riederer
Chapter 27. Effective Data Engineering in the Cloud World
Dipti Borkar
Disaggregated Data Stack
Orchestrate, Orchestrate, Orchestrate
Copying Data Creates Problems
S3 Compatibility
SQL and Structured Data Are Still In
Chapter 28. Embrace the Data Lake Architecture
Vinoth Chandar
Common Pitfalls
Data Lakes
Advantages
Implementation
Chapter 29. Embracing Data Silos
Bin Fan and Amelia Wong
Why Data Silos Exist
Embracing Data Silos
Chapter 30. Engineering Reproducible Data Science Projects
Dr. Tianhui Michael Li
Chapter 31. Five Best Practices for Stable Data Processing
Christian Lauer
Prevent Errors
Set Fair Processing Times
Use Data-Quality Measurement Jobs
Ensure Transaction Security
Consider Dependency on Other Systems
Conclusion
Chapter 32. Focus on Maintainability and Break Up Those ETL Tasks
Chris Moradi
Chapter 33. Friends Don’t Let Friends Do Dual-Writes
Gunnar Morling
Chapter 34. Fundamental Knowledge
Pedro Marcelino
Chapter 35. Getting the “Structured” Back into SQL
Elias Nema
Chapter 36. Give Data Products a Frontend with Latent Documentation
Emily Riederer
Chapter 37. How Data Pipelines Evolve
Chris Heinzmann
Chapter 38. How to Build Your Data Platform like a Product
Barr Moses and Atul Gupte
Align Your Product’s Goals with the Goals of the Business
Gain Feedback and Buy-in from the Right Stakeholders
Prioritize Long-Term Growth and Sustainability over Short-Term Gains
Sign Off on Baseline Metrics for Your Data and How You Measure It
Chapter 39. How to Prevent a Data Mutiny
Sean Knapp
Chapter 40. Know the Value per Byte of Your Data
Dhruba Borthakur
Chapter 41. Know Your Latencies
Dhruba Borthakur
Chapter 42. Learn to Use a NoSQL Database, but Not like an RDBMS
Kirk Kirkconnell
Chapter 43. Let the Robots Enforce the Rules
Anthony Burdi
Chapter 44. Listen to Your Users—but Not Too Much
Amanda Tomlinson
Chapter 45. Low-Cost Sensors and the Quality of Data
Dr. Shivanand Prabhoolall Guness
Chapter 46. Maintain Your Mechanical Sympathy
Tobias Macey
Chapter 47. Metadata ≥ Data
Jonathan Seidman
Chapter 48. Metadata Services as a Core Component of the Data Platform
Lohit VijayaRenu
Discoverability
Security Control
Schema Management
Application Interface and Service Guarantee
Chapter 49. Mind the Gap: Your Data Lake Provides No ACID Guarantees
Einat Orr
Chapter 50. Modern Metadata for the Modern Data Stack
Prukalpa Sankar
Data Assets > Tables
Complete Data Visibility, Not Piecemeal Solutions
Built for Metadata That Itself Is Big Data
Embedded Collaboration at Its Heart
Chapter 51. Most Data Problems Are Not Big Data Problems
Thomas Nield
Chapter 52. Moving from Software Engineering to Data Engineering
John Salinas
Chapter 53. Observability for Data Engineers
Barr Moses
How Good Data Turns Bad
Introducing Data Observability
Chapter 54. Perfect Is the Enemy of Good
Bob Haffner
Chapter 55. Pipe Dreams
Scott Haines
Chapter 56. Preventing the Data Lake Abyss
Scott Haines
Establishing Data Contracts
From Generic Data Lake to Data Structure Store
Chapter 57. Prioritizing User Experience in Messaging Systems
Jowanza Joseph
Chapter 58. Privacy Is Your Problem
Stephen Bailey, PhD
Chapter 59. QA and All Its Sexiness
Sonia Mehta
Chapter 60. Seven Things Data Engineers Need to Watch Out for in ML Projects
Dr. Sandeep Uttamchandani
Chapter 61. Six Dimensions for Picking an Analytical Data Warehouse
Gleb Mezhanskiy
Scalability
Price Elasticity
Interoperability
Querying and Transformation Features
Speed
Zero Maintenance
Chapter 62. Small Files in a Big Data World
Adi Polak
What Are Small Files, and Why Are They a Problem?
Why Does It Happen?
Detect and Mitigate
Conclusion
References
Chapter 63. Streaming Is Different from Batch
Dean Wampler, PhD
Chapter 64. Tardy Data
Ariel Shaqed
Chapter 65. Tech Should Take a Back Seat for Data Project Success
Andrew Stevenson
Chapter 66. Ten Must-Ask Questions for Data-Engineering Projects
Haidar Hadi
Question 1: What Are the Touch Points?
Question 2: What Are the Granularities?
Question 3: What Are the Input and Output Schemas?
Question 4: What Is the Algorithm?
Question 5: Do You Need Backfill Data?
Question 6: When Is the Project Due Date?
Question 7: Why Was That Due Date Set?
Question 8: Which Hosting Environment?
Question 9: What Is the SLA?
Question 10: Who Will Be Taking Over This Project?
Chapter 67. The Data Pipeline Is Not About Speed
Rustem Feyzkhanov
Chapter 68. The Dos and Don’ts of Data Engineering
Christopher Bergh
Don’t Be a Hero
Don’t Rely on Hope
Don’t Rely on Caution
Do DataOps
Chapter 69. The End of ETL as We Know It
Paul Singman
Replacing ETL with Intentional Data Transfer
Agreeing on a Data Model Contract
Removing Data Processing Latencies
Taking the First Steps
Chapter 70. The Haiku Approach to Writing Software
Mitch Seymour
Understand the Constraints Up Front
Start Strong Since Early Decisions Can Impact the Final Product
Keep It as Simple as Possible
Engage the Creative Side of Your Brain
Chapter 71. The Hidden Cost of Data Input/Output
Lohit VijayaRenu
Data Compression
Data Format
Data Serialization
Chapter 72. The Holy War Between Proprietary and Open Source Is a Lie
Paige Roberts
Chapter 73. The Implications of the CAP Theorem
Paul Doran
Chapter 74. The Importance of Data Lineage
Julien Le Dem
Chapter 75. The Many Meanings of Missingness
Emily Riederer
Chapter 76. The Six Words That Will Destroy Your Career
Bartosz Mikulski
Chapter 77. The Three Invaluable Benefits of Open Source for Testing Data Quality
Tom Baeyens
Chapter 78. The Three Rs of Data Engineering
Tobias Macey
Reliability
Reproducibility
Repeatability
Conclusion
Chapter 79. The Two Types of Data Engineering and Data Engineers
Jesse Anderson
Types of Data Engineering
Types of Data Engineers
Why These Differences Matter to You
Chapter 80. The Yin and Yang of Big Data Scalability
Paul Brebner
Chapter 81. Threading and Concurrency in Data Processing
Matthew Housley, PhD
Operating System Threading
Threading Overhead
Solving the C10K Problem
Scaling Is Not a Magic Bullet
Further Reading
Chapter 82. Three Important Distributed Programming Concepts
Adi Polak
MapReduce Algorithm
Distributed Shared Memory Model
Message Passing/Actors Model
Conclusions
Chapter 83. Time (Semantics) Won’t Wait
Marta Paes Moreira and Fabian Hueske
Chapter 84. Tools Don’t Matter, Patterns and Practices Do
Bas Geerdink
Chapter 85. Total Opportunity Cost of Ownership
Joe Reis
Chapter 86. Understanding the Ways Different Data Domains Solve Problems
Matthew Seal
Chapter 87. What Is a Data Engineer? Clue: We’re Data Science Enablers
Lewis Gavin
AI and Machine Learning Models Require Data
Clean Data == Better Model
Finally Building a Model
A Model Is Useful Only If Someone Will Use It
So What Am I Getting At?
Chapter 88. What Is a Data Mesh, and How Not to Mesh It Up
Barr Moses and Lior Gavish
Why Use a Data Mesh?
The Final Link: Observability
Chapter 89. What Is Big Data?
Ami Levin
Chapter 90. What to Do When You Don’t Get Any Credit
Jesse Anderson
Chapter 91. When Our Data Science Team Didn’t Produce Value
Joel Nantais
Chapter 92. When to Avoid the Naive Approach
Nimrod Parasol
Chapter 93. When to Be Cautious About Sharing Data
Thomas Nield
Chapter 94. When to Talk and When to Listen
Steven Finkelstein
Chapter 95. Why Data Science Teams Need Generalists, Not Specialists
Eric Colson
Chapter 96. With Great Data Comes Great Responsibility
Lohit VijayaRenu
Put Yourself in the User’s Shoes
Ensure Ethical Use of User Information
Watch Your Data Footprint
Chapter 97. Your Data Tests Failed! Now What?
Sam Bail, PhD
System Response
Logging and Alerting
Alert Response
Stakeholder Communication
Root Cause Identification
Issue Resolution
Contributors
Adi Polak
Amanda Tomlinson
Amelia Wong
Ami Levin
Andrew Stevenson
Anthony Burdi
Ariel Shaqed (Scolnicov)
Atul Gupte
Barr Moses
Bartosz Mikulski
Bas Geerdink
Bill Franks
Bin Fan
Bob Haffner
Boris Lublinsky
Chris Moradi
Christian Heinzmann
Christian Lauer
Christiano Anderson
Christopher Bergh
Dean Wampler
Denise Koessler Gosnell, PhD
Dipti Borkar
Dhruba Borthakur
Einat Orr
Elias Nema
Emily Riederer
Eric Colson
Fabian Hueske
Gleb Mezhanskiy
Gunnar Morling
Haidar Hadi
Ido Shlomo
James Densmore
Jeff Magnusson
Jesse Anderson
Joe Reis
Joel Nantais
John Salinas
Jonathan Seidman
Jowanza Joseph
Julien Le Dem
Katharine Jarmul
Kirk Kirkconnell
Valliappa (Lak) Lakshmanan
Lewis Gavin
Lior Gavish
Lohit VijayaRenu
Marta Paes Moreira
Matthew Housley, PhD
Matthew Seal
Meghan Kwartler
Dr. Tianhui Michael Li
Mitch Seymour
Mukul Sood
Nimrod Parasol
Paige Roberts
Paul Brebner
Paul Doran
Paul Singman
Pedro Marcelino
Dr. Shivanand Prabhoolall Guness
Prukalpa Sankar
Raghotham Murthy
Rustem Feyzkhanov
Sam Bail
Sandeep Uttamchandani
Scott Haines
Sean Knapp
Shweta Katre
Sonia Mehta
Stephen Bailey, PhD
Steven Finkelstein
Thomas Nield
Tobias Macey
Tom Baeyens
Tom White
Vijay Kiran
Vinoth Chandar
Index