This practical guide provides a collection of techniques and best practices that are generally overlooked in most data engineering and data science pedagogy. A common misconception is that great data scientists are experts in the "big themes" of the discipline—machine learning and programming. But most of the time, these tools can only take us so far. In practice, the smaller tools and skills really separate a great data scientist from a not-so-great one.
Taken as a whole, the lessons in this book make the difference between an average data scientist candidate and a qualified data scientist working in the field. Author Daniel Vaughan has collected, extended, and used these skills to create value and train data scientists from different companies and industries.
With this book, you will:
Understand how data science creates valueDeliver compelling narratives to sell your data science projectBuild a business case using unit economics...
Author(s): Daniel Vaughan
Publisher: O'Reilly Media
Year: 2023
Language: English
Pages: 254
Preface
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
I. Data Analytics Techniques
1. So What? Creating Value with Data Science
What Is Value?
What: Understanding the Business
So What: The Gist of Value Creation in DS
Now What: Be a Go-Getter
Measuring Value
Key Takeaways
Further Reading
2. Metrics Design
Desirable Properties That Metrics Should Have
Measurable
Actionable
Relevance
Timeliness
Metrics Decomposition
Funnel Analytics
Stock-Flow Decompositions
P×Q-Type Decompositions
Example: Another Revenue Decomposition
Example: Marketplaces
Key Takeaways
Further Reading
3. Growth Decompositions: Understanding Tailwinds and Headwinds
Why Growth Decompositions?
Additive Decomposition
Example
Interpretation and Use Cases
Multiplicative Decomposition
Example
Interpretation
Mix-Rate Decompositions
Example
Interpretation
Mathematical Derivations
Additive Decomposition
Multiplicative Decomposition
Mix-Rate Decomposition
Key Takeaways
Further Reading
4. 2×2 Designs
The Case for Simplification
What’s a 2×2 Design?
Example: Test a Model and a New Feature
Example: Understanding User Behavior
Example: Credit Origination and Acceptance
Example: Prioritizing Your Workflow
Key Takeaways
Further Reading
5. Building Business Cases
Some Principles to Construct Business Cases
Example: Proactive Retention Strategy
Fraud Prevention
Purchasing External Datasets
Working on a Data Science Project
Key Takeaways
Further Reading
6. What’s in a Lift?
Lifts Defined
Example: Classifier Model
Self-Selection and Survivorship Biases
Other Use Cases for Lifts
Key Takeaways
Further Reading
7. Narratives
What’s in a Narrative: Telling a Story with Your Data
Clear and to the Point
Credible
Memorable
Actionable
Building a Narrative
Science as Storytelling
What, So What, and Now What?
What?
So what?
Now what?
The Last Mile
Writing TL;DRs
Tips to Write Memorable TL;DRs
Example: Writing a TL;DR for This Chapter
Delivering Powerful Elevator Pitches
Presenting Your Narrative
Key Takeaways
Further Reading
8. Datavis: Choosing the Right Plot to Deliver a Message
Some Useful and Not-So-Used Data Visualizations
Bar Versus Line Plots
Slopegraphs
Waterfall Charts
Scatterplot Smoothers
Plotting Distributions
General Recommendations
Find the Right Datavis for Your Message
Choose Your Colors Wisely
Different Dimensions in a Plot
Aim for a Large Enough Data-Ink Ratio
Customization Versus Semiautomation
Get the Font Size Right from the Beginning
Interactive or Not
Stay Simple
Start by Explaining the Plot
Key Takeaways
Further Reading
II. Machine Learning
9. Simulation and Bootstrapping
Basics of Simulation
Simulating a Linear Model and Linear Regression
What Are Partial Dependence Plots?
Omitted Variable Bias
Simulating Classification Problems
Latent Variable Models
Comparing Different Algorithms
Bootstrapping
Key Takeaways
Further Reading
10. Linear Regression: Going Back to Basics
What’s in a Coefficient?
The Frisch-Waugh-Lovell Theorem
Why Should You Care About FWL?
Confounders
Additional Variables
The Central Role of Variance in ML
Key Takeaways
Further Reading
11. Data Leakage
What Is Data Leakage?
Outcome Is Also a Feature
A Function of the Outcome Is Itself a Feature
Bad Controls
Mislabeling of a Timestamp
Multiple Datasets with Sloppy Time Aggregations
Leakage of Other Information
Detecting Data Leakage
Complete Separation
Windowing Methodology
Choosing the Length of the Windows
The Training Stage Mirrors the Scoring Stage
Implementing the Windowing Methodology
I Have Leakage: Now What?
Key Takeaways
Further Reading
12. Productionizing Models
What Does “Production Ready” Mean?
Batch Scores (Offline)
Real-Time Model Objects
Data and Model Drift
Essential Steps in any Production Pipeline
Get and Transform Data
Validate Data
Training and Scoring Stages
Validate Model and Scores
Deploy Model and Scores
Key Takeaways
Further Reading
13. Storytelling in Machine Learning
A Holistic View of Storytelling in ML
Ex Ante and Interim Storytelling
Creating Hypotheses
Predicting human behavior
Predicting system behavior
Predicting downstream metrics
Feature Engineering
Ex Post Storytelling: Opening the Black Box
Interpretability-Performance Trade-Off
Linear Regression: Setting a Benchmark
Feature Importance
Heatmaps
Partial Dependence Plots
Accumulated Local Effects
Key Takeaways
Further Reading
14. From Prediction to Decisions
Dissecting Decision Making
Simple Decision Rules by Smart Thresholding
Precision and Recall
Example: Lead Generation
Confusion Matrix Optimization
Key Takeaways
Further Reading
15. Incrementality: The Holy Grail of Data Science?
Defining Incrementality
Causal Reasoning to Improve Prediction
Causal Reasoning as a Differentiator
Improved Decision Making
Confounders and Colliders
Selection Bias
Unconfoundedness Assumption
Breaking Selection Bias: Randomization
Matching
Machine Learning and Causal Inference
Open Source Codebases
Double Machine Learning
Key Takeaways
Further Reading
16. A/B Tests
What Is an A/B Test?
Decision Criterion
Minimum Detectable Effects
Choosing the Statistical Power, Level, and P
Estimating the Variance of the Outcome
Simulations
Example: Conversion Rates
Setting the MDE
Hypotheses Backlog
Metric
Hypothesis
Ranking
Governance of Experiments
Key Takeaways
Further Reading
17. Large Language Models and the Practice of Data Science
The Current State of AI
What Do Data Scientists Do?
Evolving the Data Scientist’s Job Description
Case Study: A/B Testing
Case Study: Data Cleansing
Case Study: Machine Learning
LLMs and This Book
Key Takeaways
Further Reading
Index