The book equips students with the end-to-end skills needed to do data science. That means gathering, cleaning, preparing, and sharing data, then using statistical models to analyse data, writing about the results of those models, drawing conclusions from them, and finally, using the cloud to put a model into production, all done in a reproducible way.
At the moment, there are a lot of books that teach data science, but most of them assume that you already have the data. This book fills that gap by detailing how to go about gathering datasets, cleaning and preparing them, before analysing them. There are also a lot of books that teach statistical modelling, but few of them teach how to communicate the results of the models and how they help us learn about the world. Very few data science textbooks cover ethics, and most of those that do, have a token ethics chapter. Finally, reproducibility is not often emphasised in data science books. This book is based around a straight-forward workflow conducted in an ethical and reproducible way: gather data, prepare data, analyse data, and communicate those findings. This book will achieve the goals by working through extensive case studies in terms of gathering and preparing data, and integrating ethics throughout. It is specifically designed around teaching how to write about the data and models, so aspects such as writing are explicitly covered. And finally, the use of GitHub and the open-source statistical language R are built in throughout the book.
Key Features:
Extensive code examples.
Ethics integrated throughout.
Reproducibility integrated throughout.
Focus on data gathering, messy data, and cleaning data.
Extensive formative assessment throughout.
Author(s): Rohan Alexander
Series: Data Science Series
Publisher: CRC Press
Year: 2023
Language: English
Pages: 623
Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Table of contents
Preface
Audience and assumed background
Structure and content
Pedagogy and key features
Software information and conventions
About the author
Acknowledgments
I. Foundations
1. Telling stories with data
1.1. On telling stories
1.2. Workflow components
1.3. Telling stories with data
1.4. How do our worlds become data?
1.5. What is data science and how should we use it to learn about the world?
1.6. Exercises
Questions
Tutorial
2. Drinking from a fire hose
2.1. Hello, World!
2.2. Australian elections
2.2.1. Plan
2.2.2. Simulate
2.2.3. Acquire
2.2.4. Explore
2.2.5. Share
2.3. Toronto’s unhoused population
2.3.1. Plan
2.3.2. Simulate
2.3.3. Acquire
2.3.4. Explore
2.3.5. Share
2.4. Neonatal mortality
2.4.1. Plan
2.4.2. Simulate
2.4.3. Acquire
2.4.4. Explore
2.4.5. Share
2.5. Concluding remarks
2.6. Exercises
Scales
Questions
Tutorial
3. Reproducible workflows
3.1. Introduction
3.2. Quarto
3.2.1. Getting started
3.2.2. Top matter
3.2.3. Essential commands
3.2.4. R chunks
3.2.5. Equations
3.2.6. Cross-references
3.3. R Projects and file structure
3.4. Version control
3.4.1. Git
3.4.2. GitHub
3.5. Using R in practice
3.5.1. Dealing with errors
3.5.2. Reproducible examples
3.5.3. Mentality
3.5.4. Code comments and style
3.5.5. Tests
3.6. Efficiency
3.6.1. Sharing a code environment
3.6.2. Code linting and styling
3.6.3. Code review
3.6.4. Code refactoring
3.6.5. Parallel processing
3.7. Concluding remarks
3.8. Exercises
Scales
Questions
Tutorial
Paper
II. Communication
4. Writing research
4.1. Introduction
4.2. Writing
4.3. Asking questions
4.3.1. Data-first
4.3.2. Question-first
4.4. Answering questions
4.5. Components of a paper
4.5.1. Title
4.5.2. Abstract
4.5.3. Introduction
4.5.4. Data
4.5.5. Model
4.5.6. Results
4.5.7. Discussion
4.5.8. Brevity, typos, and grammar
4.5.9. Rules
4.6. Exercises
Scales
Questions
Tutorial
5. Static communication
5.1. Introduction
5.2. Graphs
5.2.1. Bar charts
5.2.2. Scatterplots
5.2.3. Line plots
5.2.4. Histograms
5.2.5. Boxplots
5.3. Tables
5.3.1. Showing part of a dataset
5.3.2. Improving the formatting
5.3.3. Communicating summary statistics
5.3.4. Display regression results
5.4. Maps
5.4.1. Australian polling places
5.4.2. United States military bases
5.4.3. Geocoding
5.5. Concluding remarks
5.6. Exercises
Scales
Questions
Tutorial
Paper
III. Acquisition
6. Farm data
6.1. Introduction
6.2. Measurement
6.2.1. Properties of measurements
6.2.2. Measurement error
6.2.3. Missing data
6.3. Censuses and other government data
6.3.1. Canada
6.3.2. United States
6.4. Sampling essentials
6.4.1. Sampling in Dublin and Reading
6.4.2. Probabilistic sampling
6.4.3. Non-probability samples
6.5. Exercises
Scales
Questions
Tutorial
7. Gather data
7.1. Introduction
7.2. APIs
7.2.1. arXiv, NASA, and Dataverse
7.2.2. Spotify
7.3. Web scraping
7.3.1. Principles
7.3.2. HTML/CSS essentials
7.3.3. Book information
7.3.4. Prime Ministers of the United Kingdom
7.3.5. Iteration
7.4. PDFs
7.4.1. Jane Eyre
7.4.2. Total Fertility Rate in the United States
7.4.3. Optical Character Recognition
7.5. Exercises
Scales
Questions
Tutorial
8. Hunt data
8.1. Introduction
8.2. Field experiments and randomized controlled trials
8.2.1. Randomization
8.2.2. Simulated example: cats or dogs
8.2.3. Treatment and control
8.2.4. Fisher’s tea party
8.2.5. Ethical foundations
8.3. Surveys
8.3.1. Democracy Fund Voter Study Group
8.4. RCT examples
8.4.1. The Oregon Health Insurance Experiment
8.4.2. Civic Honesty Around The Globe
8.5. A/B testing
8.5.1. Upworthy
8.6. Exercises
Scales
Questions
Tutorial
Paper
IV. Preparation
9. Clean and prepare
9.1. Introduction
9.2. Workflow
9.2.1. Save the original, unedited data
9.2.2. Plan
9.2.3. Start small
9.2.4. Write tests and documentation
9.2.5. Iterate, generalize, and update
9.3. Checking and testing
9.3.1. Graphs
9.3.2. Counts
9.3.3. Tests
9.4. Simulated example: running times
9.5. Names
9.5.1. Machine-readable
9.5.2. Human-readable
9.6. 1996 Tanzanian DHS
9.7. 2019 Kenyan census
9.7.1. Gather and clean
9.7.2. Check and test
9.7.3. Tidy-up
9.8. Exercises
Scales
Questions
Tutorial
10. Store and share
10.1. Introduction
10.2. Plan
10.3. Share
10.3.1. GitHub
10.3.2. R packages for data
10.3.3. Depositing data
10.4. Data documentation
10.5. Personally identifying information
10.5.1. Hashing
10.5.2. Simulation
10.5.3. Differential privacy
10.6. Data efficiency
10.6.1. Iteration
10.6.2. Apache Arrow
10.7. Exercises
Scales
Questions
Tutorial
Paper
V. Modeling
11. Exploratory data analysis
11.1. Introduction
11.2. 1975 United States population and income data
11.3. Missing data
11.4. TTC subway delays
11.4.1. Distribution and properties of individual variables
11.4.2. Relationships between variables
11.5. Airbnb listings in London, England
11.5.1. Distribution and properties of individual variables
11.5.2. Relationships between variables
11.6. Concluding remarks
11.7. Exercises
Scales
Questions
Tutorial
12. Linear models
12.1. Introduction
12.2. Simple linear regression
12.2.1. Simulated example: running times
12.3. Multiple linear regression
12.3.1. Simulated example: running times with rain and humidity
12.4. Building models
12.5. Concluding remarks
12.6. Exercises
Scales
Questions
Tutorial
Paper
13. Generalized linear models
13.1. Introduction
13.2. Logistic regression
13.2.1. Simulated example: day or night
13.2.2. Political support in the United States
13.3. Poisson regression
13.3.1. Simulated example: number of As by department
13.3.2. Letters used in Jane Eyre
13.4. Negative binomial regression
13.4.1. Mortality in Alberta, Canada
13.5. Multilevel modeling
13.5.1. Simulated example: political support
13.5.2. Austen, Brontë, Dickens, and Shakespeare
13.6. Concluding remarks
13.7. Exercises
Scales
Questions
Tutorial
Paper
VI. Applications
14. Causality from observational data
14.1. Introduction
14.2. Directed Acyclic Graphs
14.2.1. Confounder
14.2.2. Mediator
14.2.3. Collider
14.3. Two common paradoxes
14.3.1. Simpson’s paradox
14.3.2. Berkson’s paradox
14.4. Difference-in-differences
14.4.1. Simulated example: tennis serve speed
14.4.2. Assumptions
14.4.3. French newspaper prices between 1960 and 1974
14.5. Propensity score matching
14.5.1. Simulated example: free shipping
14.6. Regression discontinuity design
14.6.1. Simulated example: income and grades
14.6.2. Assumptions
14.6.3. Alcohol and crime in California
14.7. Instrumental variables
14.7.1. Simulated example: health status, smoking, and tax rates
14.7.2. Assumptions
14.8. Exercises
Scales
Questions
Tutorial
15. Multilevel regression with post-stratification
15.1. Introduction
15.2. Simulated example: coffee or tea?
15.2.1. Construct a population and biased sample
15.2.2. Model the sample
15.2.3. Post-stratification dataset
15.3. Forecasting the 2020 United States election
15.3.1. Survey data
15.3.2. Post-stratification data
15.3.3. Model the sample
15.3.4. Post-stratify
15.4. Exercises
Scales
Questions
Tutorial
Paper
16. Text as data
16.1. Introduction
16.2. Text cleaning and preparation
16.2.1. Stop words
16.2.2. Case, numbers, and punctuation
16.2.3. Typos and uncommon words
16.2.4. Tuples
16.2.5. Stemming and lemmatizing
16.2.6. Duplication
16.3. Term Frequency-Inverse Document Frequency (TF-IDF)
16.3.1. Distinguishing horoscopes
16.4. Topic models
16.4.1. What is talked about in the Canadian parliament?
16.5. Exercises
Scales
Questions
Tutorial
17. Concluding remarks
17.1. Concluding remarks
17.2. Some outstanding issues
17.3. Next steps
17.4. Exercises
Questions
References
Index