Learning Data Science: Data Wrangling, Exploration, Visualization, and Modeling with Python (Final)

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

As an aspiring data scientist, you appreciate why organizations rely on data for important decisions—whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a messy pile of data into actionable insights. We call this the data science lifecycle: the process of collecting, wrangling, analyzing, and drawing conclusions from data.

Learning Data Science is the first book to cover foundational skills in both programming and statistics that encompass this entire lifecycle. It's aimed at those who wish to become data scientists or who already work with data scientists, and at data analysts who wish to cross the "technical/nontechnical" divide. If you have a basic knowledge of Python programming, you'll learn how to work with data using industry-standard tools like pandas.

  • Refine a question of interest to one that can be studied with...
  • Author(s): Sam Lau
    Publisher: O'Reilly Media
    Year: 2023

    Language: English
    Pages: 594

    Preface
    Expected Background Knowledge
    Organization of the Book
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    I. The Data Science Lifecycle
    1. The Data Science Lifecycle
    The Stages of the Lifecycle
    Examples of the Lifecycle
    Summary
    2. Questions and Data Scope
    Big Data and New Opportunities
    Example: Google Flu Trends
    Target Population, Access Frame, and Sample
    Example: What Makes Members of an Online Community Active?
    Example: Who Will Win the Election?
    Example: How Do Environmental Hazards Relate to an Individual’s Health?
    Instruments and Protocols
    Measuring Natural Phenomena
    Example: What Is the Level of CO2 in the Air?
    Accuracy
    Types of Bias
    Types of Variation
    Summary
    3. Simulation and Data Design
    The Urn Model
    Sampling Designs
    Sampling Distribution of a Statistic
    Simulating the Sampling Distribution
    Simulation with the Hypergeometric Distribution
    Example: Simulating Election Poll Bias and Variance
    The Pennsylvania Urn Model
    An Urn Model with Bias
    Conducting Larger Polls
    Example: Simulating a Randomized Trial for a Vaccine
    Scope
    The Urn Model for Random Assignment
    Example: Measuring Air Quality
    Summary
    4. Modeling with Summary Statistics
    The Constant Model
    Minimizing Loss
    Mean Absolute Error
    Mean Squared Error
    Choosing Loss Functions
    Summary
    5. Case Study: Why Is My Bus Always Late?
    Question and Scope
    Data Wrangling
    Exploring Bus Times
    Modeling Wait Times
    Summary
    II. Rectangular Data
    6. Working with Dataframes Using pandas
    Subsetting
    Data Scope and Question
    Dataframes and Indices
    Slicing
    Filtering Rows
    Example: How Recently Has Luna Become a Popular Name?
    Aggregating
    Basic Group-Aggregate
    Example: Using .value_counts()
    Grouping on Multiple Columns
    Custom Aggregation Functions
    Pivoting
    Joining
    Inner Joins
    Left, Right, and Outer Joins
    Example: Popularity of NYT Name Categories
    Transforming
    Apply
    Example: Popularity of “L” Names
    The Price of Apply
    How Are Dataframes Different from Other Data Representations?
    Dataframes and Spreadsheets
    Dataframes and Matrices
    Dataframes and Relations
    Summary
    7. Working with Relations Using SQL
    Subsetting
    SQL Basics: SELECT and FROM
    What’s a Relation?
    Slicing
    Filtering Rows
    Example: How Recently Has Luna Become a Popular Name?
    Aggregating
    Basic Group-Aggregate Using GROUP BY
    Grouping on Multiple Columns
    Other Aggregation Functions
    Joining
    Inner Joins
    Left and Right Joins
    Example: Popularity of NYT Name Categories
    Transforming and Common Table Expressions
    SQL Functions
    Multistep Queries Using a WITH Clause
    Example: Popularity of “L” Names
    Summary
    III. Understanding The Data
    8. Wrangling Files
    Data Source Examples
    Drug Abuse Warning Network (DAWN) Survey
    San Francisco Restaurant Food Safety
    File Formats
    Delimited Format
    Fixed-Width Format
    Hierarchical Formats
    Loosely Formatted Text
    File Encoding
    File Size
    The Shell and Command-Line Tools
    Table Shape and Granularity
    Granularity of Restaurant Inspections and Violations
    DAWN Survey Shape and Granularity
    Summary
    9. Wrangling Dataframes
    Example: Wrangling CO2 Measurements from the Mauna Loa Observatory
    Quality Checks
    Addressing Missing Data
    Reshaping the Data Table
    Quality Checks
    Quality Based on Scope
    Quality of Measurements and Recorded Values
    Quality Across Related Features
    Quality for Analysis
    Fixing the Data or Not
    Missing Values and Records
    Transformations and Timestamps
    Transforming Timestamps
    Piping for Transformations
    Modifying Structure
    Example: Wrangling Restaurant Safety Violations
    Narrowing the Focus
    Aggregating Violations
    Extracting Information from Violation Descriptions
    Summary
    10. Exploratory Data Analysis
    Feature Types
    Example: Dog Breeds
    Transforming Qualitative Features
    Relabel categories
    Collapse categories
    Convert quantitative to ordinal
    The Importance of Feature Types
    What to Look For in a Distribution
    What to Look For in a Relationship
    Two Quantitative Features
    One Qualitative and One Quantitative Variable
    Two Qualitative Features
    Comparisons in Multivariate Settings
    Guidelines for Exploration
    Example: Sale Prices for Houses
    Understanding Price
    What Next?
    Examining Other Features
    Delving Deeper into Relationships
    Fixing Location
    EDA Discoveries
    Summary
    11. Data Visualization
    Choosing Scale to Reveal Structure
    Filling the Data Region
    Including Zero
    Revealing Shape Through Transformations
    Banking to Decipher Relationships
    Revealing Relationships Through Straightening
    Smoothing and Aggregating Data
    Smoothing Techniques to Uncover Shape
    Smoothing Techniques to Uncover Relationships and Trends
    Smoothing Techniques Need Tuning
    Reducing Distributions to Quantiles
    When Not to Smooth
    Facilitating Meaningful Comparisons
    Emphasize the Important Difference
    Ordering Groups
    Avoid Stacking
    Selecting a Color Palette
    Guidelines for Comparisons in Plots
    Incorporating the Data Design
    Data Collected Over Time
    Observational Studies
    Unequal Sampling
    Geographic Data
    Adding Context
    Example: 100m Sprint Times
    Creating Plots Using plotly
    Figure and Trace Objects
    Modifying Layout
    Plotting Functions
    Annotations
    Other Tools for Visualization
    matplotlib
    Grammar of Graphics
    Summary
    12. Case Study: How Accurate Are Air Quality Measurements?
    Question, Design, and Scope
    Finding Collocated Sensors
    Wrangling the List of AQS Sites
    Wrangling the List of PurpleAir Sites
    Matching AQS and PurpleAir Sensors
    Wrangling and Cleaning AQS Sensor Data
    Checking Granularity
    Removing Unneeded Columns
    Checking the Validity of Dates
    Checking the Quality of PM2.5 Measurements
    Wrangling PurpleAir Sensor Data
    Checking the Granularity
    Visualizing timestamps
    Checking the sampling rate
    Handling Missing Values
    Exploring PurpleAir and AQS Measurements
    Creating a Model to Correct PurpleAir Measurements
    Summary
    IV. Other Data Sources
    13. Working with Text
    Examples of Text and Tasks
    Convert Text into a Standard Format
    Extract a Piece of Text to Create a Feature
    Transform Text into Features
    Text Analysis
    String Manipulation
    Converting Text to a Standard Format with Python String Methods
    String Methods in pandas
    Splitting Strings to Extract Pieces of Text
    Regular Expressions
    Concatenation of Literals
    Character classes
    Wildcard character
    Negated character classes
    Shorthands for character classes
    Anchors and boundaries
    Escaping metacharacters
    Quantifiers
    Alternation and Grouping to Create Features
    Reference Tables
    Text Analysis
    Summary
    14. Data Exchange
    NetCDF Data
    JSON Data
    HTTP
    REST
    XML, HTML, and XPath
    Example: Scraping Race Times from Wikipedia
    XPath
    Example: Accessing Exchange Rates from the ECB
    Summary
    V. Linear Modeling
    15. Linear Models
    Simple Linear Model
    Example: A Simple Linear Model for Air Quality
    Interpreting Linear Models
    Assessing the Fit
    Fitting the Simple Linear Model
    Multiple Linear Model
    Fitting the Multiple Linear Model
    Example: Where Is the Land of Opportunity?
    Explaining Upward Mobility Using Commute Time
    Relating Upward Mobility Using Multiple Variables
    Feature Engineering for Numeric Measurements
    Feature Engineering for Categorical Measurements
    Summary
    16. Model Selection
    Overfitting
    Example: Energy Consumption
    Train-Test Split
    Cross-Validation
    Regularization
    Model Bias and Variance
    Summary
    17. Theory for Inference and Prediction
    Distributions: Population, Empirical, Sampling
    Basics of Hypothesis Testing
    Example: A Rank Test to Compare Productivity of Wikipedia Contributors
    Example: A Test of Proportions for Vaccine Efficacy
    Bootstrapping for Inference
    Basics of Confidence Intervals
    Basics of Prediction Intervals
    Example: Predicting Bus Lateness
    Example: Predicting Crab Size
    Example: Predicting the Incremental Growth of a Crab
    Probability for Inference and Prediction
    Formalizing the Theory for Average Rank Statistics
    General Properties of Random Variables
    Probability Behind Testing and Intervals
    Probability Behind Model Selection
    Summary
    18. Case Study: How to Weigh a Donkey
    Donkey Study Question and Scope
    Wrangling and Transforming
    Exploring
    Modeling a Donkey’s Weight
    A Loss Function for Prescribing Anesthetics
    Fitting a Simple Linear Model
    Fitting a Multiple Linear Model
    Bringing Qualitative Features into the Model
    Model Assessment
    Summary
    VI. Classification
    19. Classification
    Example: Wind-Damaged Trees
    Modeling and Classification
    A Constant Model
    Examining the Relationship Between Size and Windthrow
    Modeling Proportions (and Probabilities)
    A Logistic Model
    Log Odds
    Using a Logistic Curve
    A Loss Function for the Logistic Model
    From Probabilities to Classification
    The Confusion Matrix
    Precision Versus Recall
    Summary
    20. Numerical Optimization
    Gradient Descent Basics
    Minimizing Huber Loss
    Convex and Differentiable Loss Functions
    Variants of Gradient Descent
    Stochastic Gradient Descent
    Mini-Batch Gradient Descent
    Newton’s Method
    Summary
    21. Case Study: Detecting Fake News
    Question and Scope
    Obtaining and Wrangling the Data
    Exploring the Data
    Exploring the Publishers
    Exploring Publication Date
    Exploring Words in Articles
    Modeling
    A Single-Word Model
    Multiple-Word Model
    Predicting with the tf-idf Transform
    Summary
    Additional Material
    Data Sources
    Index