Learning Data Science: Programming and Statistics Fundamentals Using Python (Seventh Early Release)

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

As an aspiring data scientist, you appreciate why organizations rely on data for important decisions--whether it's for companies designing websites, cities deciding how to improve services, or scientists discovering how to stop the spread of disease. And you want the skills required to distill a messy pile of data into actionable insights. We call this the Data Science lifecycle: the process of collecting, wrangling, analyzing, and drawing conclusions from data. Learning Data Science is the first book to cover foundational skills in both programming and statistics that encompass this entire lifecycle. It's aimed at those who wish to become data scientists or who already work with data scientists, and at data analysts who wish to cross the "technical/nontechnical" divide. If you have a basic knowledge of Python programming, you'll learn how to work with data using industry-standard tools like Pandas. Refine a question of interest to one that can be studied with data Pursue data collection that may involve text processing, web scraping, etc. Glean valuable insights about data through data cleaning, exploration, and visualization Learn how to use modeling to describe the data Generalize findings beyond the data Expected Background Knowledge: We expect readers to be proficient in Python and understand how to: use built-in data structures like lists, dictionaries, and sets; import and use functions and classes from other packages; and write functions from scratch. We also use the Numpy Python package without introduction but don’t expect readers to have much prior experience using it. Readers will get more from this book if they also know a bit of probability, calculus, and linear algebra, but we aim to explain mathematical ideas intuitively.

Author(s): Sam Lau, Deborah Nolan, and Joseph Gonzalez
Publisher: O'Reilly Media, Inc.
Year: 2022

Language: English
Commentary: early release, raw and unedited
Pages: 666

Preface
Expected Background Knowledge
Organization of the Book
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgements
I. The Data Science Lifecycle
1. The Data Science Lifecycle
The Stages of the Lifecycle
Examples of the Lifecycle
Summary
2. Questions and Data Scope
Big Data and New Opportunities
Example: Google Flu Trends
Target Population, Access Frame, Sample
Instruments and Protocols
Measuring Natural Phenomenon
Accuracy
Types of Bias
Types of Variation
Summary
3. Simulation and Data Design
The Urn Model
Sampling Designs
Sampling Distribution of a Statistic
Simulating the Sampling Distribution
The Hypergeometric Distribution
Example: Simulating Election Poll Bias and Variance
The Pennsylvania Urn Model
An Urn Model with Bias
Conducting Larger Polls
Example: Simulating a Randomized Trial for a Vaccine
Scope
The Urn Model for Random Assignment
Example: Measuring Air Quality
Summary
4. Modeling with Summary Statistics
The Constant Model
Minimizing Loss
Mean Absolute Error
Mean Squared Error
Choosing Loss Functions
Summary
5. Case Study: Why is my Bus Always Late?
Question and Scope
Data Wrangling
Exploring Bus Times
Modeling Wait Times
Summary
II. Rectangular Data
6. Working With Dataframes Using pandas
Subsetting
Data Scope and Question
DataFrames and Indices
Slicing
Filtering Rows
Example: How recently has Luna become a popular name?
Aggregating
Basic Group-Aggregate
Grouping on Multiple Columns
Custom Aggregation Functions
Example: Have People Become More Creative With Baby Names?
Pivoting
Joining
Inner Joins
Left, Right, and Outer Joins
Example: Popularity of NYT Name Categories
Transforming
Apply
Example: Popularity of “L” Names
The Price of Apply
How are Dataframes Different from Other Data Representations?
Dataframes and Spreadsheets
Dataframes and Matrices
Dataframes and Relations
Summary
7. Working With Relations Using SQL
Subsetting
SQL Basics: SELECT and FROM
What’s a Relation?
Slicing
Filtering Rows
Example: How recently has Luna become a popular name?
Aggregating
Basic Group-Aggregate using GROUP BY
Grouping on Multiple Columns
Other Aggregation Functions
Joining
Inner Joins
Left and Right Joins
Example: Popularity of NYT Name Categories
Transforming and Common Table Expressions
SQL Functions
Multistep Queries Using a WITH Clause
Example: Popularity of “L” Names
Summary
III. Understanding The Data
8. Wrangling Files
Data Source Examples
Drug Abuse Warning Network (DAWN) Survey
San Francisco Restaurant Food Safety
File Formats
Delimited format
Fixed-width Format
Hierarchical Formats
Loosely Formatted Text
File Encoding
File Size
Working with Large Data Sets
The Shell and Command Line Tools
Table Shape and Granularity
Granularity of Restaurant Inspections and Violations
DAWN Survey Shape and Granularity
Summary
9. Wrangling Dataframes
Example: Wrangling CO2 Measurements from Mauna Loa Observatory
Quality Checks
Addressing Missing Data
Reshaping the Data Table
Quality Checks
Quality based on scope
Quality of measurements and recorded values
Quality across related features
Quality for analysis
Fixing the Data or Not
Missing Values and Records
Imputing Missing Values
Transformations and Timestamps
Transforming Timestamps
Piping for Transformations
Modifying Structure
Example: Wrangling Restaurant Safety Violations
Narrowing the Focus
Aggregating Violations
Extracting Information from Violation Descriptions
Summary
10. Exploratory Data Analysis
Feature Types
Example: Dog Breeds
Transforming Qualitative Features
The Importance of Feature Types
What to Look For in a Distribution
What to Look For in a Relationship
Two Quantitative Features
One Qualitative and One Quantitative Variable
Two Qualitative Features
Comparisons in Multivariate Settings
Guidelines for Exploration
Example: Sale Prices for Houses
Understanding Price
What Next?
Examining other features
Delving Deeper into Relationships
Fixing Location
EDA discoveries
Summary
11. Data Visualization
Choosing Scale to Reveal Structure
Filling the Data Region
Including Zero
Revealing Shape Through Transformations
Banking to Decipher Relationships
Revealing Relationships Through Straightening
Smoothing and Aggregating Data
Smoothing Techniques to Uncover Shape
Smoothing Techniques to Uncover Relationships and Trends
Smoothing Techniques Need Tuning
Reducing Distributions to Quantiles
When Not to Smooth
Facilitating Meaningful Comparisons
Emphasize the Important Difference
Ordering Groups
Avoid Stacking
Selecting a Color Palette
Guidelines for Comparisons in Plots
Incorporating the Data Design
Data Collected over Time
Observational Studies
Unequal Sampling
Geographic Data
Adding Context
Example: 100m Sprint Times
Creating Plots Using plotly
Figure and Trace Objects
Modifying Layout
Plotting Functions
Annotations
Other Tools for Visualization
matplotlib
Grammar of Graphics
Summary
12. Case Study: How Accurate are Air Quality Measurements?
Question, Design, and Scope
Finding Collocated Sensors
Wrangling the List of AQS Sites
Wrangling the List of PurpleAir Sites
Matching AQS and PurpleAir Sensors
Wrangling and Cleaning AQS Sensor Data
Checking Granularity
Removing Unneeded Columns
Checking the Validity of Dates
Checking the Quality of PM2.5 Measurements
Wrangling PurpleAir Sensor Data
Checking the Granularity
Handling Missing Values
Exploring PurpleAir and AQS Measurements
Creating a Model to Correct PurpleAir Measurements
Summary
IV. Other Data Sources
13. Working with Text
Examples of Text and Tasks
Convert text into a standard format
Extract a piece of text to create a feature
Transform text into features
Text analysis
String Manipulation
Converting Text to a Standard Format with Python String Methods
String Methods in pandas
Splitting Strings to Extract Pieces of Text
Regular Expressions
Concatenation of Literals
Quantifiers
Alternation and Grouping to Create Features
Reference Tables
Text Analysis
Summary
14. Data Exchange
NetCDF Data
Example: Rainfall Around the World
JSON Data
Example: Air Quality Data Exchange
HTTP
REST
Example: Retrieving Info on Clash Songs from Spotify
XML, HTML, and XPath
Example: Scraping Race Times from Wikipedia
XPath
Example: Accessing Exchange Rates from the ECB
Summary
V. Linear Modeling
15. Linear Models
Simple Linear Model
Example: A Simple Linear Model for Air Quality
Interpreting Linear Models
Assessing the Fit
Fitting the Simple Linear Model
Multiple Linear Model
Example: A Multiple Linear Model for Air Quality
Fitting the Multiple Linear Model
A Geometric Problem
Example: Where is the Land of Opportunity?
Explaining Upward Mobility using Commute Time
Relating Upward Mobility Using Multiple Variables
Feature Engineering for Numeric Measurements
Feature Engineering for Categorical Measurements
Summary
16. Model Selection
Overfitting
Example: Energy Consumption
Train-Test Split
Cross-Validation
Example: Fitting a Bent Line Model with Cross-validation
Regularization
Example: A Market Analysis
Model Bias and Variance
Summary
17. Theory for Inference and Prediction
Distributions: Population, Empirical, Sampling
Basics of Hypothesis Testing
Example: A Rank-test to Compare Productivity of Wikipedia Contributors
Example: A Test of Proportions for Vaccine Efficacy
Bootstrapping for Inference
Boostrapping a Test for a Regression Coefficient
Basics of Confidence Intervals
Confidence Intervals for a Coefficient
Basics of Prediction Intervals
Example: Predicting Bus Lateness
Example: Predicting Crab Size
Example: Predicting the Incremental Growth of a Crab
Probability for Inference and Prediction
Formalizing the Theory for Average rank statistics
General Properties of Random Variables
Probability Behind Testing and Intervals
Probability Behind Model Selection
Summary
18. Case Study: How to Weigh a Donkey
Donkey Study Question and Scope
Wrangling and Transforming
Train-Test Split of the Data
Exploring
Modeling a Donkey’s Weight
A Loss Function for Prescribing Anesthetics
Fitting a Simple Linear Model
Fitting a Multiple Linear Model
Bringing Qualitative Features into the Model
Model Assessment
Summary
VI. Classification
19. Classification
Example: Wind Damaged Trees
Modeling and Classification
A Constant Model
Examining the Relationship Between Size and Windthrow
Modeling Proportions (and Probabilities)
A Logistic Model
Log Odds
Using a Logistic Curve
A Loss Function for the Logistic Model
Fitting a Logistic Model
From Probabilities to Classification
The Confusion Matrix
Precision vs Recall
Summary
20. Numerical Optimization
Gradient Descent Basics
Minimizing Huber Loss
Convex and Differentiable Loss Functions
Variants of Gradient Descent
Stochastic Gradient Descent
Mini-batch Gradient Descent
Newton’s Method
Summary
21. Case Study: Detecting Fake News
Question and Scope
Obtaining and Wrangling the Data
Exploring the Data
Exploring the Publishers
Exploring Publication Date
Exploring Words in Articles
Modeling
A Single-Word Model
Multiple Word Model
Predicting with the tf-idf Transform
Summary
About the Authors