Data Science: The Hard Parts: Techniques for Excelling at Data Science

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This practical guide provides a collection of techniques and best practices that are generally overlooked in most data engineering and data science pedagogy. A common misconception is that great data scientists are experts in the "big themes" of the discipline—machine learning and programming. But most of the time, these tools can only take us so far. In practice, the smaller tools and skills really separate a great data scientist from a not-so-great one.

Taken as a whole, the lessons in this book make the difference between an average data scientist candidate and a qualified data scientist working in the field. Author Daniel Vaughan has collected, extended, and used these skills to create value and train data scientists from different companies and industries.

With this book, you will:

  • Understand how data science creates value
  • Deliver compelling narratives to sell your data science project
  • Build a business case using unit economics...
  • Author(s): Daniel Vaughan
    Publisher: O'Reilly Media
    Year: 2023

    Language: English
    Pages: 254

    Preface
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    I. Data Analytics Techniques
    1. So What? Creating Value with Data Science
    What Is Value?
    What: Understanding the Business
    So What: The Gist of Value Creation in DS
    Now What: Be a Go-Getter
    Measuring Value
    Key Takeaways
    Further Reading
    2. Metrics Design
    Desirable Properties That Metrics Should Have
    Measurable
    Actionable
    Relevance
    Timeliness
    Metrics Decomposition
    Funnel Analytics
    Stock-Flow Decompositions
    P×Q-Type Decompositions
    Example: Another Revenue Decomposition
    Example: Marketplaces
    Key Takeaways
    Further Reading
    3. Growth Decompositions: Understanding Tailwinds and Headwinds
    Why Growth Decompositions?
    Additive Decomposition
    Example
    Interpretation and Use Cases
    Multiplicative Decomposition
    Example
    Interpretation
    Mix-Rate Decompositions
    Example
    Interpretation
    Mathematical Derivations
    Additive Decomposition
    Multiplicative Decomposition
    Mix-Rate Decomposition
    Key Takeaways
    Further Reading
    4. 2×2 Designs
    The Case for Simplification
    What’s a 2×2 Design?
    Example: Test a Model and a New Feature
    Example: Understanding User Behavior
    Example: Credit Origination and Acceptance
    Example: Prioritizing Your Workflow
    Key Takeaways
    Further Reading
    5. Building Business Cases
    Some Principles to Construct Business Cases
    Example: Proactive Retention Strategy
    Fraud Prevention
    Purchasing External Datasets
    Working on a Data Science Project
    Key Takeaways
    Further Reading
    6. What’s in a Lift?
    Lifts Defined
    Example: Classifier Model
    Self-Selection and Survivorship Biases
    Other Use Cases for Lifts
    Key Takeaways
    Further Reading
    7. Narratives
    What’s in a Narrative: Telling a Story with Your Data
    Clear and to the Point
    Credible
    Memorable
    Actionable
    Building a Narrative
    Science as Storytelling
    What, So What, and Now What?
    What?
    So what?
    Now what?
    The Last Mile
    Writing TL;DRs
    Tips to Write Memorable TL;DRs
    Example: Writing a TL;DR for This Chapter
    Delivering Powerful Elevator Pitches
    Presenting Your Narrative
    Key Takeaways
    Further Reading
    8. Datavis: Choosing the Right Plot to Deliver a Message
    Some Useful and Not-So-Used Data Visualizations
    Bar Versus Line Plots
    Slopegraphs
    Waterfall Charts
    Scatterplot Smoothers
    Plotting Distributions
    General Recommendations
    Find the Right Datavis for Your Message
    Choose Your Colors Wisely
    Different Dimensions in a Plot
    Aim for a Large Enough Data-Ink Ratio
    Customization Versus Semiautomation
    Get the Font Size Right from the Beginning
    Interactive or Not
    Stay Simple
    Start by Explaining the Plot
    Key Takeaways
    Further Reading
    II. Machine Learning
    9. Simulation and Bootstrapping
    Basics of Simulation
    Simulating a Linear Model and Linear Regression
    What Are Partial Dependence Plots?
    Omitted Variable Bias
    Simulating Classification Problems
    Latent Variable Models
    Comparing Different Algorithms
    Bootstrapping
    Key Takeaways
    Further Reading
    10. Linear Regression: Going Back to Basics
    What’s in a Coefficient?
    The Frisch-Waugh-Lovell Theorem
    Why Should You Care About FWL?
    Confounders
    Additional Variables
    The Central Role of Variance in ML
    Key Takeaways
    Further Reading
    11. Data Leakage
    What Is Data Leakage?
    Outcome Is Also a Feature
    A Function of the Outcome Is Itself a Feature
    Bad Controls
    Mislabeling of a Timestamp
    Multiple Datasets with Sloppy Time Aggregations
    Leakage of Other Information
    Detecting Data Leakage
    Complete Separation
    Windowing Methodology
    Choosing the Length of the Windows
    The Training Stage Mirrors the Scoring Stage
    Implementing the Windowing Methodology
    I Have Leakage: Now What?
    Key Takeaways
    Further Reading
    12. Productionizing Models
    What Does “Production Ready” Mean?
    Batch Scores (Offline)
    Real-Time Model Objects
    Data and Model Drift
    Essential Steps in any Production Pipeline
    Get and Transform Data
    Validate Data
    Training and Scoring Stages
    Validate Model and Scores
    Deploy Model and Scores
    Key Takeaways
    Further Reading
    13. Storytelling in Machine Learning
    A Holistic View of Storytelling in ML
    Ex Ante and Interim Storytelling
    Creating Hypotheses
    Predicting human behavior
    Predicting system behavior
    Predicting downstream metrics
    Feature Engineering
    Ex Post Storytelling: Opening the Black Box
    Interpretability-Performance Trade-Off
    Linear Regression: Setting a Benchmark
    Feature Importance
    Heatmaps
    Partial Dependence Plots
    Accumulated Local Effects
    Key Takeaways
    Further Reading
    14. From Prediction to Decisions
    Dissecting Decision Making
    Simple Decision Rules by Smart Thresholding
    Precision and Recall
    Example: Lead Generation
    Confusion Matrix Optimization
    Key Takeaways
    Further Reading
    15. Incrementality: The Holy Grail of Data Science?
    Defining Incrementality
    Causal Reasoning to Improve Prediction
    Causal Reasoning as a Differentiator
    Improved Decision Making
    Confounders and Colliders
    Selection Bias
    Unconfoundedness Assumption
    Breaking Selection Bias: Randomization
    Matching
    Machine Learning and Causal Inference
    Open Source Codebases
    Double Machine Learning
    Key Takeaways
    Further Reading
    16. A/B Tests
    What Is an A/B Test?
    Decision Criterion
    Minimum Detectable Effects
    Choosing the Statistical Power, Level, and P
    Estimating the Variance of the Outcome
    Simulations
    Example: Conversion Rates
    Setting the MDE
    Hypotheses Backlog
    Metric
    Hypothesis
    Ranking
    Governance of Experiments
    Key Takeaways
    Further Reading
    17. Large Language Models and the Practice of Data Science
    The Current State of AI
    What Do Data Scientists Do?
    Evolving the Data Scientist’s Job Description
    Case Study: A/B Testing
    Case Study: Data Cleansing
    Case Study: Machine Learning
    LLMs and This Book
    Key Takeaways
    Further Reading
    Index