This practical guide provides a collection of techniques and best practices that are generally overlooked in most data engineering and data science pedagogy. A common misconception is that great data scientists are experts in the "big themes" of the discipline—machine learning and programming. But most of the time, these tools can only take us so far. In practice, the smaller tools and skills really separate a great data scientist from a not-so-great one.
Taken as a whole, the lessons in this book make the difference between an average data scientist candidate and a qualified data scientist working in the field. Author Daniel Vaughan has collected, extended, and used these skills to create value and train data scientists from different companies and industries.
With this book, you will:
• Understand how data science creates value
• Deliver compelling narratives to sell your data science project
• Build a business case using unit economics principles
• Create new features for a ML model using storytelling
• Learn how to decompose KPIs
• Perform growth decompositions to find root causes for changes in a metric
Daniel Vaughan is head of data at Clip, the leading paytech company in Mexico. He's the author of Analytical Skills for AI and Data Science (O'Reilly).
Author(s): Daniel Vaughan
Publisher: O'Reilly Media
Year: 2023
Language: English
Commentary: Publisher's PDF
Pages: 254
City: Sebastopol, CA
Tags: Machine Learning; Data Science; Python; Predictive Models; Data Visualization; Best Practices; Linear Regression; Storytelling; Communication; Production Models; Bootstrapping; A/B Testing; Narrative Model; Metrics; Simulations; Large Language Models; Data Leakage
Cover
Copyright
Table of Contents
Preface
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Part I. Data Analytics Techniques
Chapter 1. So What? Creating Value with Data Science
What Is Value?
What: Understanding the Business
So What: The Gist of Value Creation in DS
Now What: Be a Go-Getter
Measuring Value
Key Takeaways
Further Reading
Chapter 2. Metrics Design
Desirable Properties That Metrics Should Have
Measurable
Actionable
Relevance
Timeliness
Metrics Decomposition
Funnel Analytics
Stock-Flow Decompositions
P×Q-Type Decompositions
Example: Another Revenue Decomposition
Example: Marketplaces
Key Takeaways
Further Reading
Chapter 3. Growth Decompositions: Understanding Tailwinds and Headwinds
Why Growth Decompositions?
Additive Decomposition
Example
Interpretation and Use Cases
Multiplicative Decomposition
Example
Interpretation
Mix-Rate Decompositions
Example
Interpretation
Mathematical Derivations
Additive Decomposition
Multiplicative Decomposition
Mix-Rate Decomposition
Key Takeaways
Further Reading
Chapter 4. 2×2 Designs
The Case for Simplification
What’s a 2×2 Design?
Example: Test a Model and a New Feature
Example: Understanding User Behavior
Example: Credit Origination and Acceptance
Example: Prioritizing Your Workflow
Key Takeaways
Further Reading
Chapter 5. Building Business Cases
Some Principles to Construct Business Cases
Example: Proactive Retention Strategy
Fraud Prevention
Purchasing External Datasets
Working on a Data Science Project
Key Takeaways
Further Reading
Chapter 6. What’s in a Lift?
Lifts Defined
Example: Classifier Model
Self-Selection and Survivorship Biases
Other Use Cases for Lifts
Key Takeaways
Further Reading
Chapter 7. Narratives
What’s in a Narrative: Telling a Story with Your Data
Clear and to the Point
Credible
Memorable
Actionable
Building a Narrative
Science as Storytelling
What, So What, and Now What?
The Last Mile
Writing TL;DRs
Tips to Write Memorable TL;DRs
Example: Writing a TL;DR for This Chapter
Delivering Powerful Elevator Pitches
Presenting Your Narrative
Key Takeaways
Further Reading
Chapter 8. Datavis: Choosing the Right Plot to Deliver a Message
Some Useful and Not-So-Used Data Visualizations
Bar Versus Line Plots
Slopegraphs
Waterfall Charts
Scatterplot Smoothers
Plotting Distributions
General Recommendations
Find the Right Datavis for Your Message
Choose Your Colors Wisely
Different Dimensions in a Plot
Aim for a Large Enough Data-Ink Ratio
Customization Versus Semiautomation
Get the Font Size Right from the Beginning
Interactive or Not
Stay Simple
Start by Explaining the Plot
Key Takeaways
Further Reading
Part II. Machine Learning
Chapter 9. Simulation and Bootstrapping
Basics of Simulation
Simulating a Linear Model and Linear Regression
What Are Partial Dependence Plots?
Omitted Variable Bias
Simulating Classification Problems
Latent Variable Models
Comparing Different Algorithms
Bootstrapping
Key Takeaways
Further Reading
Chapter 10. Linear Regression: Going Back to Basics
What’s in a Coefficient?
The Frisch-Waugh-Lovell Theorem
Why Should You Care About FWL?
Confounders
Additional Variables
The Central Role of Variance in ML
Key Takeaways
Further Reading
Chapter 11. Data Leakage
What Is Data Leakage?
Outcome Is Also a Feature
A Function of the Outcome Is Itself a Feature
Bad Controls
Mislabeling of a Timestamp
Multiple Datasets with Sloppy Time Aggregations
Leakage of Other Information
Detecting Data Leakage
Complete Separation
Windowing Methodology
Choosing the Length of the Windows
The Training Stage Mirrors the Scoring Stage
Implementing the Windowing Methodology
I Have Leakage: Now What?
Key Takeaways
Further Reading
Chapter 12. Productionizing Models
What Does “Production Ready” Mean?
Batch Scores (Offline)
Real-Time Model Objects
Data and Model Drift
Essential Steps in any Production Pipeline
Get and Transform Data
Validate Data
Training and Scoring Stages
Validate Model and Scores
Deploy Model and Scores
Key Takeaways
Further Reading
Chapter 13. Storytelling in Machine Learning
A Holistic View of Storytelling in ML
Ex Ante and Interim Storytelling
Creating Hypotheses
Feature Engineering
Ex Post Storytelling: Opening the Black Box
Interpretability-Performance Trade-Off
Linear Regression: Setting a Benchmark
Feature Importance
Heatmaps
Partial Dependence Plots
Accumulated Local Effects
Key Takeaways
Further Reading
Chapter 14. From Prediction to Decisions
Dissecting Decision Making
Simple Decision Rules by Smart Thresholding
Precision and Recall
Example: Lead Generation
Confusion Matrix Optimization
Key Takeaways
Further Reading
Chapter 15. Incrementality: The Holy Grail of Data Science?
Defining Incrementality
Causal Reasoning to Improve Prediction
Causal Reasoning as a Differentiator
Improved Decision Making
Confounders and Colliders
Selection Bias
Unconfoundedness Assumption
Breaking Selection Bias: Randomization
Matching
Machine Learning and Causal Inference
Open Source Codebases
Double Machine Learning
Key Takeaways
Further Reading
Chapter 16. A/B Tests
What Is an A/B Test?
Decision Criterion
Minimum Detectable Effects
Choosing the Statistical Power, Level, and P
Estimating the Variance of the Outcome
Simulations
Example: Conversion Rates
Setting the MDE
Hypotheses Backlog
Metric
Hypothesis
Ranking
Governance of Experiments
Key Takeaways
Further Reading
Chapter 17. Large Language Models and the Practice of Data Science
The Current State of AI
What Do Data Scientists Do?
Evolving the Data Scientist’s Job Description
Case Study: A/B Testing
Case Study: Data Cleansing
Case Study: Machine Learning
LLMs and This Book
Key Takeaways
Further Reading
Index
About the Author
Colophon