Exploring Data Science with R and the Tidyverse, A Concise Introduction

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This book introduces the reader to data science using R and the tidyverse. No prerequisite knowledge is needed in college-level programming or mathematics (e.g., calculus or statistics). The book is self-contained so readers can immediately begin building data science workflows without needing to reference extensive amounts of external resources for onboarding. The contents are targeted for undergraduate students but are equally applicable to students at the graduate level and beyond. The book develops concepts using many real-world examples to motivate the reader. Upon completion of the text, the reader will be able to: Gain proficiency in R programming Load and manipulate data frames, and "tidy" them using tidyverse tools Conduct statistical analyses and draw meaningful inferences from them Perform modeling from numerical and textual data Generate data visualizations (numerical and spatial) using ggplot2 and understand what is being represented An accompanying R package "edsdata" contains synthetic and real datasets used by the textbook and is meant to be used for further practice. An exercise set is made available and designed for compatibility with automated grading tools for instructor use. As you develop familiarity with processing data, you learn how to develop intuition from the data at hand by glancing at its values. Unfortunately, there is only so much you can do with glancing at values. There is a substantial limitation to what you can obtain when the data at hand is so large. Visualization is a powerful tool in such cases. In this chapter we introduce another key member of the tidyverse, the ggplot2 package, for visualization. R provides many facilities for creating visualizations. The most sophisticated of them, and perhaps the most elegant, is ggplot2. In this section we introduce generating visualizations using ggplot2.

Author(s): Jerry Bonnell, Mitsunori Ogihara
Publisher: CRC Press
Year: 2023

Language: English

Cover
Half Title
Title Page
Copyright Page
Dedication
Contents
Welcome
1. Data Types
1.1. Integers and Doubles
1.1.1. A primer in computer architecture
1.1.2. Specifying integers and doubles in R
1.1.3. Finding the size limit
1.2. Strings
1.2.1. Prerequisite
1.2.2. Strings in R
1.2.3. Conversions to and from numbers
1.2.4. Length of strings
1.2.5. Substrings
1.2.6. String concatenation
1.3. Logical Data
1.3.1. Comparisons
1.3.2. Boolean operations
1.3.3. Comparing strings
1.4. Vectors
1.4.1. Sequences
1.4.2. The combine function
1.4.3. Element-wise operations
1.4.4. Booleans and element-wise operations
1.4.5. Functions on vectors
1.5. Lists
1.5.1. Working with lists
1.5.2. Visualizing lists
1.6. stringr Operations
1.6.1. Prerequisites
1.6.2. Regular expressions
1.6.3. Detect matches
1.6.4. Subset strings
1.6.5. Manage lengths
1.6.6. Mutate strings
1.6.7. Join and split
1.6.8. Sorting
1.6.9. An example: stringr and lists
1.7. Exercises
2. Data Transformation
2.1. Datasets and Tidy Data
2.1.1. Prerequisites
2.1.2. A “hello world!” dataset
2.1.3. In pursuit of tidy data
2.1.4. Example: is it tidy?
2.2. Working with Datasets
2.2.1. Prerequisites
2.2.2. The data frame
2.2.3. Tibbles
2.2.4. Accessing columns and rows
2.2.5. Extracting basic information from a tibble
2.2.6. Creating tibbles
2.2.7. Loading data from an external source
2.2.8. Writing results to a file
2.3. dplyr Verbs
2.3.1. Prerequisites
2.3.2. A fast overview of the verbs
2.3.3. Selecting columns with select
2.3.4. Filtering rows with filter
2.3.5. Re-arranging rows with arrange
2.3.6. Selecting rows with slice
2.3.7. Renaming columns with rename
2.3.8. Relocating column positions with relocate
2.3.9. Adding new columns using mutate
2.3.10. The function transmute
2.3.11. The pair group_by and summarize
2.3.12. Coordinating multiple actions using |>
2.3.13. Practice makes perfect!
2.4. Tidy Transformations
2.4.1. Prerequisites
2.4.2. Uniting and separating columns
2.4.3. Pulling data from multiple sources
2.4.4. Pivoting
2.5. Applying Functions to Columns
2.5.1. Prerequisites
2.5.2. What is a function anyway?
2.5.3. A very simple function
2.5.4. Functions that compute a value
2.5.5. Functions that take arguments
2.5.6. Applying functions using mutate
2.5.7. purrr maps
2.5.8. purrr with mutate
2.6. Handling Missing Values
2.6.1. Prerequisites
2.6.2. A dataset with missing values
2.6.3. Eliminating rows with missing values
2.6.4. Filling values by looking at neighbors
2.6.5. Filling values according to a global constant
2.7. Exercises
3. Data Visualization
3.1. Introduction to ggplot2
3.1.1. Prerequisites
3.1.2. The layered grammar of graphics
3.2. Point Geoms
3.2.1. Prerequisites
3.2.2. The mpg tibble
3.2.3. Your first visualization
3.2.4. Scatter plots
3.2.5. Adding color to your geoms
3.2.6. Mapping versus setting
3.2.7. Categorical variables
3.2.8. Continuous variables
3.2.9. Other articulations
3.2.10. Jittering
3.2.11. One more scatter plot
3.3. Line and Smooth Geoms
3.3.1. Prerequisites
3.3.2. A toy data frame
3.3.3. The line geom
3.3.4. Combining ggplot calls with dplyr
3.3.5. Smoothers
3.3.6. Observing a negative trend
3.3.7. Working with multiple geoms
3.4. Categorical Variables
3.4.1. Prerequisites
3.4.2. The happy and diamonds data frames
3.4.3. Bar charts
3.4.4. Dealing with categorical variables
3.4.5. More on positional adjustments
3.4.6. Coordinate systems
3.5. Numerical Variables
3.5.1. Prerequisites
3.5.2. A slice of mpg
3.5.3. What is a histogram?
3.6. Histogram Shapes and Sizes
3.6.1. The horizontal axis and bar width
3.6.2. The counts in the bins
3.6.3. Density scale
3.6.4. Why bother with density scale?
3.6.5. Density scale makes direct comparisons possible
3.6.6. Histograms and positional adjustments
3.7. Drawing Maps
3.7.1. Prerequisites
3.7.2. Simple maps with polygon geoms
3.7.3. Shape data and simple feature geoms
3.7.4. Choropleth maps
3.7.5. Interactive maps with mapview
3.8. Exercises
4. Building Simulations
4.1. The sample Function
4.2. if Conditionals
4.2.1. Remember logical data types?
4.2.2. The if statement
4.2.3. The if statement: a general description
4.2.4. One more example: comparing strings
4.3. for Loops
4.3.1. Prerequisites
4.3.2. Feeling lucky
4.3.3. A multi-round betting game
4.3.4. Recording outcomes
4.3.5. Example: 1,000 tosses
4.4. A Recipe for Simulation
4.4.1. Prerequisites
4.4.2. Step 1: Determine what to simulate
4.4.3. Step 2: Figure out how to simulate one value
4.4.4. Step 3: Run the simulation and visualize!
4.4.5. Putting the steps together in R
4.4.6. Difference in the number of heads and tails in 100 coin tosses
4.4.7. Step 1: Determine what to simulate
4.4.8. Step 2: Figure out how to simulate one value
4.4.9. Step 3: Run and visualize!
4.4.10. Not-so-doubling rewards
4.4.11. Accumulation
4.4.12. Lasting effects of errors
4.4.13. Simulation round-up
4.5. The Birthday Paradox
4.5.1. Prerequisites
4.5.2. A quick theoretical exploration
4.5.3. Simulating the paradox
4.5.4. Step 1: Determine what to simulate
4.5.5. Step 2: Figure out how to simulate one value
4.5.6. Step 3: Run and visualize!
4.6. Exercises
5. Sampling
5.1. To Sample or Not to Sample?
5.1.1. Prerequisites
5.1.2. The existential questions: Shakespeare ponders sampling
5.1.3. Deterministic samples
5.1.4. Random sampling
5.1.5. To sample systematically or not?
5.1.6. To sample with replacement or not?
5.1.7. To select samples uniformly or not?
5.2. Distribution of a Sample
5.2.1. Prerequisites
5.2.2. Throwing a 6-sided die
5.2.3. Pop quiz: why sample with replacement?
5.3. Populations
5.3.1. Prerequisites
5.3.2. Sampling distribution of departure delays
5.3.3. Summary: Histogram of the sample
5.4. The Mean and Median
5.4.1. Prerequisites
5.4.2. Properties of the mean
5.4.3. The mean: a measure of central tendency
5.4.4. Symmetric distributions
5.4.5. The mean of two identical distributions is identical
5.5. Simulating a Statistic
5.5.1. Prerequisites
5.5.2. The variability of statistics
5.5.3. Simulating a statistic
5.5.4. Guessing a “lucky” number
5.5.5. Step 1: Select the statistic to estimate
5.5.6. Step 2: Write code for estimation
5.5.7. Step 3: Generate estimations and visualize
5.5.8. Median flight delay in flights
5.5.9. Step 1: Select the statistic to estimate
5.5.10. Step 2: Write code for estimation
5.5.11. Step 3: Generate estimations and visualize
5.6. Convenience Sampling
5.6.1. Prerequisites
5.6.2. Systematic selection
5.6.3. Beware: the presence of patterns
5.7. Exercises
6. Hypothesis Testing
6.1. Testing a Model
6.1.1. Prerequisites
6.1.2. The rmultinom function
6.1.3. A model for 10,000 coin tosses
6.1.4. Chance of the observed value of the test statistic occurring
6.2. Case Study: Entering Harvard
6.2.1. Prerequisites
6.2.2. Students for Fair Admissions
6.2.3. Proportion of Asian American students
6.3. Significance Levels
6.3.1. Prerequisites
6.3.2. A midterm grumble?
6.3.3. Cut-off points
6.3.4. The significance level is an error probability
6.3.5. The verdict: is the TA guilty?
6.3.6. Choosing a test statistic
6.4. Permutation Testing
6.4.1. Prerequisites
6.4.2. The effect of a tutoring program
6.4.3. A permutation test
6.4.4. Conclusion
6.4.5. Comparing Summer and Winter Olympic athletes
6.4.6. The test
6.4.7. Conclusion
6.5. Exercises
7. Quantifying Uncertainty
7.1. Order Statistics
7.1.1. Prerequisites
7.1.2. The flights data frame
7.1.3. median
7.1.4. min and max
7.2. Percentiles
7.2.1. Prerequisites
7.2.2. The finals tibble
7.2.3. The quantile function
7.2.4. Quartiles
7.2.5. Combining two percentiles
7.2.6. Advantages of percentiles
7.3. Resampling
7.3.1. Prerequisites
7.3.2. Population parameter: the median time spent in the air
7.3.3. First try: A mechanical sample
7.3.4. Resampling the sample mean
7.3.5. Distribution of the sample mean
7.3.6. Did it capture the parameter?
7.3.7. Second try: A random sample
7.3.8. Distribution of the sample mean (revisited)
7.3.9. Lucky try?
7.3.10. Resampling round-up
7.4. Confidence Intervals
7.4.1. Prerequisites
7.4.2. Estimating a population proportion
7.4.3. Levels of uncertainty: 80% and 99% confidence intervals
7.4.4. Confidence intervals as a hypothesis test
7.4.5. Final remarks: resampling with care
7.5. Exercises
8. Towards Normality
8.1. Standard Deviation
8.1.1. Prerequisites
8.1.2. Definition of standard deviation
8.1.3. Example: exam scores
8.1.4. Sample standard deviation
8.1.5. The sd function
8.2. More on Standard Deviation
8.2.1. Prerequisites
8.2.2. Working with SD
8.2.3. Standard units
8.2.4. Example: judging a contest
8.2.5. Be careful with summary statistics!
8.3. The Normal Curve
8.3.1. Prerequisites
8.3.2. The standard normal curve
8.3.3. Area under the curve
8.3.4. Normality in real data
8.3.5. Athlete heights and the dnorm function
8.4. Central Limit Theorem
8.4.1. Prerequisites
8.4.2. Example: Net allotments from a clumsy clerk
8.4.3. Central Limit Theorem
8.4.4. Comparing average systolic blood pressure
8.5. Exercises
9. Regression
9.1. Correlation
9.1.1. Prerequisites
9.1.2. Visualizing correlation with a scatter plot
9.1.3. The correlation coefficient r
9.1.4. Technical considerations
9.1.5. Be careful with summary statistics! (revisited)
9.2. Linear Regression
9.2.1. Prerequisites
9.2.2. The trees data frame
9.3. First Approach: Nearest Neighbors Regression
9.3.1. The simple linear regression model
9.3.2. The regression line in standard units
9.3.3. Equation of the regression line
9.3.4. The line of least squares
9.3.5. Numerical optimization
9.3.6. Computing a regression line using base R
9.4. Using Linear Regression
9.4.1. Prerequisites
9.4.2. Palmer Station penguins
9.4.3. Tidy linear regression
9.4.4. Including multiple predictors
9.4.5. Curse of confounding variables
9.5. Regression and Inference
9.5.1. Prerequisites
9.5.2. Assumptions of the regression model
9.5.3. Making predictions about unknown observations
9.5.4. Resampling a confidence interval
9.5.5. How significant is the slope?
9.6. Graphical Diagnostics
9.6.1. Prerequisites
9.6.2. A reminder on assumptions
9.6.3. Some instructive examples
9.6.4. The residual plot
9.6.5. Detecting lack of homoskedasticity
9.6.6. Detecting nonlinearity
9.6.7. What to do from here?
9.7. Exercises
10. Text Analysis
10.1. Tidy Text
10.1.1. Prerequisites
10.1.2. Downloading texts using gutenbergr
10.1.3. Tokens and the principle of tidy text
10.1.4. Stopwords
10.1.5. Tidy text and non-tidy forms
10.2. Frequency Analysis
10.2.1. Prerequisites
10.2.2. An oeuvre of Melville’s prose
10.2.3. Visualizing popular words
10.2.4. Just how popular was Moby Dick’s vocabulary?
10.3. Topic Modeling
10.3.1. Prerequisites
10.3.2. Melville in perspective: the American Renaissance
10.3.3. Preparation for topic modeling
10.3.4. Creating a three-topic model
10.3.5. A bit of LDA vocabulary
10.3.6. Visualizing top per-word probabilities
10.3.7. Where the model goes wrong: per-document misclassifications
10.3.8. The glue: Digital Humanities
10.3.9. Further reading
10.4. Exercises
Index