R in Action: Data analysis and graphics with R and Tidyverse

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

R is the most powerful tool you can use for statistical analysis. This definitive guide smooths R’s steep learning curve with practical solutions and real-world applications for commercial environments. In R in Action, Third Edition you will learn how to: • Set up and install R and RStudio • Clean, manage, and analyze data with R • Use the ggplot2 package for graphs and visualizations • Solve data management problems using R functions • Fit and interpret regression models • Test hypotheses and estimate confidence • Simplify complex multivariate data with principal components and exploratory factor analysis • Make predictions using time series forecasting • Create dynamic reports and stunning visualizations • Techniques for debugging programs and creating packages R in Action, Third Edition makes learning R quick and easy. That’s why thousands of data scientists have chosen this guide to help them master the powerful language. Far from being a dry academic tome, every example you’ll encounter in this book is relevant to scientific and business developers, and helps you solve common data challenges. R expert Rob Kabacoff takes you on a crash course in statistics, from dealing with messy and incomplete data to creating stunning visualizations. This revised and expanded third edition contains fresh coverage of the new tidyverse approach to data analysis and R’s state-of-the-art graphing capabilities with the ggplot2 package. About the technology Used daily by data scientists, researchers, and quants of all types, R is the gold standard for statistical data analysis. This free and open source language includes packages for everything from advanced data visualization to deep learning. Instantly comfortable for mathematically minded users, R easily handles practical problems without forcing you to think like a software engineer. About the book R in Action, Third Edition teaches you how to do statistical analysis and data visualization using R and its popular tidyverse packages. In it, you’ll investigate real-world data challenges, including forecasting, data mining, and dynamic report writing. This revised third edition adds new coverage for graphing with ggplot2, along with examples for machine learning topics like clustering, classification, and time series analysis. What's inside • Clean, manage, and analyze data • Use the ggplot2 package for graphs and visualizations • Techniques for debugging programs and creating packages • A complete learning resource for R and tidyverse About the reader Requires basic math and statistics. No prior experience with R needed. About the author Dr. Robert I Kabacoff is a professor of quantitative analytics at Wesleyan University and a seasoned data scientist with more than 20 years of experience.

Author(s): Robert I. Kabacoff
Edition: 3
Publisher: Manning
Year: 2022

Language: English
Commentary: Publisher's PDF
Pages: 656
City: Shelter Island, NY
Tags: Data Analysis; Regression; Programming; Classification; Principal Component Analysis; Data Visualization; R; Statistics; Data Cleaning; Analysis of Variance; Time Series Analysis; Reporting; Data Management; Missing Data; Generalized Linear Models; Cluster Analysis

R in Action, Third Edition
brief contents
contents
preface
acknowledgments
about this book
What's new in the third edition
Who should read this book
How this book is organized: A road map
Advice for data miners
About the code
liveBook discussion forum
about the author
about the cover illustration
Part 1 Getting started
1 Introduction to R
1.1 Why use R?
1.2 Obtaining and installing R
1.3 Working with R
1.3.1 Getting started
1.3.2 Using RStudio
1.3.3 Getting help
1.3.4 The workspace
1.3.5 Projects
1.4 Packages
1.4.1 What are packages?
1.4.2 Installing a package
1.4.3 Loading a package
1.4.4 Learning about a package
1.5 Using output as input: Reusing results
1.6 Working with large datasets
1.7 Working through an example
Summary
2 Creating a dataset
2.1 Understanding datasets
2.2 Data structures
2.2.1 Vectors
2.2.2 Matrices
2.2.3 Arrays
2.2.4 Data frames
2.2.5 Factors
2.2.6 Lists
2.2.7 Tibbles
2.3 Data input
2.3.1 Entering data from the keyboard
2.3.2 Importing data from a delimited text file
2.3.3 Importing data from Excel
2.3.4 Importing data from JSON
2.3.5 Importing data from the web
2.3.6 Importing data from SPSS
2.3.7 Importing data from SAS
2.3.8 Importing data from Stata
2.3.9 Accessing database management systems
2.3.10 Importing data via Stat/Transfer
2.4 Annotating datasets
2.4.1 Variable labels
2.4.2 Value labels
2.5 Useful functions for working with data objects
Summary
3 Basic data management
3.1 A working example
3.2 Creating new variables
3.3 Recoding variables
3.4 Renaming variables
3.5 Missing values
3.5.1 Recoding values to missing
3.5.2 Excluding missing values from analyses
3.6 Date values
3.6.1 Converting dates to character variables
3.6.2 Going further
3.7 Type conversions
3.8 Sorting data
3.9 Merging datasets
3.9.1 Adding columns to a data frame
3.9.2 Adding rows to a data frame
3.10 Subsetting datasets
3.10.1 Selecting variables
3.10.2 Dropping variables
3.10.3 Selecting observations
3.10.4 The subset() function
3.10.5 Random samples
3.11 Using dplyr to manipulate data frames
3.11.1 Basic dplyr functions
3.11.2 Using pipe operators to chain statements
3.12 Using SQL statements to manipulate data frames
Summary
4 Getting started with graphs
4.1 Creating a graph with ggplot2
4.1.1 ggplot
4.1.2 Geoms
4.1.3 Grouping
4.1.4 Scales
4.1.5 Facets
4.1.6 Labels
4.1.7 Themes
4.2 ggplot2 details
4.2.1 Placing the data and mapping options
4.2.2 Graphs as objects
4.2.3 Saving graphs
4.2.4 Common mistakes
Summary
5 Advanced data management
5.1 A data management challenge
5.2 Numerical and character functions
5.2.1 Mathematical functions
5.2.2 Statistical functions
5.2.3 Probability functions
5.2.4 Character functions
5.2.5 Other useful functions
5.2.6 Applying functions to matrices and data frames
5.2.7 A solution for the data management challenge
5.3 Control flow
5.3.1 Repetition and looping
5.3.2 Conditional execution
5.4 User-written functions
5.5 Reshaping data
5.5.1 Transposing
5.5.2 Converting from wide to long dataset formats
5.6 Aggregating data
Summary
Part 2 Basic methods
6 Basic graphs
6.1 Bar charts
6.1.1 Simple bar charts
6.1.2 Stacked, grouped, and filled bar charts
6.1.3 Mean bar charts
6.1.4 Tweaking bar charts
6.2 Pie charts
6.3 Tree maps
6.4 Histograms
6.5 Kernel density plots
6.6 Box plots
6.6.1 Using parallel box plots to compare groups
6.6.2 Violin plots
6.7 Dot plots
Summary
7 Basic statistics
7.1 Descriptive statistics
7.1.1 A menagerie of methods
7.1.2 Even more methods
7.1.3 Descriptive statistics by group
7.1.4 Summarizing data interactively with dplyr
7.1.5 Visualizing results
7.2 Frequency and contingency tables
7.2.1 Generating frequency tables
7.2.2 Tests of independence
7.2.3 Measures of association
7.2.4 Visualizing results
7.3 Correlations
7.3.1 Types of correlations
7.3.2 Testing correlations for significance
7.3.3 Visualizing correlations
7.4 T-tests
7.4.1 Independent t-test
7.4.2 Dependent t-test
7.4.3 When there are more than two groups
7.5 Nonparametric tests of group differences
7.5.1 Comparing two groups
7.5.2 Comparing more than two groups
7.6 Visualizing group differences
Summary
Part 3 Intermediate methods
8 Regression
8.1 The many faces of regression
8.1.1 Scenarios for using OLS regression
8.1.2 What you need to know
8.2 OLS regression
8.2.1 Fitting regression models with lm()
8.2.2 Simple linear regression
8.2.3 Polynomial regression
8.2.4 Multiple linear regression
8.2.5 Multiple linear regression with interactions
8.3 Regression diagnostics
8.3.1 A typical approach
8.3.2 An enhanced approach
8.3.3 Multicollinearity
8.4 Unusual observations
8.4.1 Outliers
8.4.2 High-leverage points
8.4.3 Influential observations
8.5 Corrective measures
8.5.1 Deleting observations
8.5.2 Transforming variables
8.5.3 Adding or deleting variables
8.5.4 Trying a different approach
8.6 Selecting the “best” regression model
8.6.1 Comparing models
8.6.2 Variable selection
8.7 Taking the analysis further
8.7.1 Cross-validation
8.7.2 Relative importance
Summary
9 Analysis of variance
9.1 A crash course on terminology
9.2 Fitting ANOVA models
9.2.1 The aov() function
9.2.2 The order of formula terms
9.3 One-way ANOVA
9.3.1 Multiple comparisons
9.3.2 Assessing test assumptions
9.4 One-way ANCOVA
9.4.1 Assessing test assumptions
9.4.2 Visualizing the results
9.5 Two-way factorial ANOVA
9.6 Repeated measures ANOVA
9.7 Multivariate analysis of variance (MANOVA)
9.7.1 Assessing test assumptions
9.7.2 Robust MANOVA
9.8 ANOVA as regression
Summary
10 Power analysis
10.1 A quick review of hypothesis testing
10.2 Implementing power analysis with the pwr package
10.2.1 T-tests
10.2.2 ANOVA
10.2.3 Correlations
10.2.4 Linear models
10.2.5 Tests of proportions
10.2.6 Chi-square tests
10.2.7 Choosing an appropriate effect size in novel situations
10.3 Creating power analysis plots
10.4 Other packages
Summary
11 Intermediate graphs
11.1 Scatter plots
11.1.1 Scatter plot matrices
11.1.2 High-density scatter plots
11.1.3 3D scatter plots
11.1.4 Spinning 3D scatter plots
11.1.5 Bubble plots
11.2 Line charts
11.3 Corrgrams
11.4 Mosaic plots
Summary
12 Resampling statistics and bootstrapping
12.1 Permutation tests
12.2 Permutation tests with the coin package
12.2.1 Independent two-sample and k-sample tests
12.2.2 Independence in contingency tables
12.2.3 Independence between numeric variables
12.2.4 Dependent two-sample and k-sample tests
12.2.5 Going further
12.3 Permutation tests with the lmPerm package
12.3.1 Simple and polynomial regression
12.3.2 Multiple regression
12.3.3 One-way ANOVA and ANCOVA
12.3.4 Two-way ANOVA
12.4 Additional comments on permutation tests
12.5 Bootstrapping
12.6 Bootstrapping with the boot package
12.6.1 Bootstrapping a single statistic
12.6.2 Bootstrapping several statistics
Summary
Part 4 Advanced methods
13 Generalized linear models
13.1 Generalized linear models and the glm() function
13.1.1 The glm() function
13.1.2 Supporting functions
13.1.3 Model fit and regression diagnostics
13.2 Logistic regression
13.2.1 Interpreting the model parameters
13.2.2 Assessing the impact of predictors on the probability of an outcome
13.2.3 Overdispersion
13.2.4 Extensions
13.3 Poisson regression
13.3.1 Interpreting the model parameters
13.3.2 Overdispersion
13.3.3 Extensions
Summary
14 Principal components and factor analysis
14.1 Principal components and factor analysis in R
14.2 Principal components
14.2.1 Selecting the number of components to extract
14.2.2 Extracting principal components
14.2.3 Rotating principal components
14.2.4 Obtaining principal component scores
14.3 Exploratory factor analysis
14.3.1 Deciding how many common factors to extract
14.3.2 Extracting common factors
14.3.3 Rotating factors
14.3.4 Factor scores
14.3.5 Other EFA-related packages
14.4 Other latent variable models
Summary
15 Time series
15.1 Creating a time-series object in R
15.2 Smoothing and seasonal decomposition
15.2.1 Smoothing with simple moving averages
15.2.2 Seasonal decomposition
15.3 Exponential forecasting models
15.3.1 Simple exponential smoothing
15.3.2 Holt and Holt–Winters exponential smoothing
15.3.3 The ets() function and automated forecasting
15.4 ARIMA forecasting models
15.4.1 Prerequisite concepts
15.4.2 ARMA and ARIMA models
15.4.3 Automated ARIMA forecasting
15.5 Going further
Summary
16 Cluster analysis
16.1 Common steps in cluster analysis
16.2 Calculating distances
16.3 Hierarchical cluster analysis
16.4 Partitioning-cluster analysis
16.4.1 K-means clustering
16.4.2 Partitioning around medoids
16.5 Avoiding nonexistent clusters
16.6 Going further
Summary
17 Classification
17.1 Preparing the data
17.2 Logistic regression
17.3 Decision trees
17.3.1 Classical decision trees
17.3.2 Conditional inference trees
17.4 Random forests
17.5 Support vector machines
17.5.1 Tuning an SVM
17.6 Choosing a best predictive solution
17.7 Understanding black box predictions
17.7.1 Break-down plots
17.7.2 Plotting Shapley values
17.8 Going further
Summary
18 Advanced methods for missing data
18.1 Steps in dealing with missing data
18.2 Identifying missing values
18.3 Exploring missing-values patterns
18.3.1 Visualizing missing values
18.3.2 Using correlations to explore missing values
18.4 Understanding the sources and impact of missing data
18.5 Rational approaches for dealing with incomplete data
18.6 Deleting missing data
18.6.1 Complete-case analysis (listwise deletion)
18.6.2 Available case analysis (pairwise deletion)
18.7 Single imputation
18.7.1 Simple imputation
18.7.2 K-nearest neighbor imputation
18.7.3 missForest
18.8 Multiple imputation
18.9 Other approaches to missing data
Summary
Part 5 Expanding your skills
19 Advanced graphs
19.1 Modifying scales
19.1.1 Customizing axes
19.1.2 Customizing colors
19.2 Modifying themes
19.2.1 Prepackaged themes
19.2.2 Customizing fonts
19.2.3 Customizing legends
19.2.4 Customizing the plot area
19.3 Adding annotations
19.4 Combining graphs
19.5 Making graphs interactive
Summary
20 Advanced programming
20.1 A review of the language
20.1.1 Data types
20.1.2 Control structures
20.1.3 Creating functions
20.2 Working with environments
20.3 Non-standard evaluation
20.4 Object-oriented programming
20.4.1 Generic functions
20.4.2 Limitations of the S3 model
20.5 Writing efficient code
20.5.1 Efficient data input
20.5.2 Vectorization
20.5.3 Correctly sizing objects
20.5.4 Parallelization
20.6 Debugging
20.6.1 Common sources of errors
20.6.2 Debugging tools
20.6.3 Session options that support debugging
20.6.4 Using RStudio’s visual debugger
20.7 Going further
Summary
21 Creating dynamic reports
21.1 A template approach to reports
21.2 Creating a report with R and R Markdown
21.3 Creating a report with R and LaTeX
21.3.1 Creating a parameterized report
21.4 Avoiding common R Markdown problems
21.5 Going further
Summary
22 Creating a package
22.1 The edatools package
22.2 Creating a package
22.2.1 Installing development tools
22.2.2 Creating a package project
22.2.3 Writing the package functions
22.2.4 Adding function documentation
22.2.5 Adding a general help file (optional)
22.2.6 Adding sample data to the package (optional)
22.2.7 Adding a vignette (optional)
22.2.8 Editing the DESCRIPTION file
22.2.9 Building and installing the package
22.3 Sharing your package
22.3.1 Distributing a source package file
22.3.2 Submitting to CRAN
22.3.3 Hosting on GitHub
22.3.4 Creating a package website
22.4 Going further
Summary
afterword Into the rabbit hole
appendix A Graphical user interfaces
appendix B Customizing the startup environment
appendix C Exporting data from R
C.1 Delimited text file
C.2 Excel spreadsheet
C.3 Statistical applications
appendix D Matrix algebra in R
appendix E Packages used in this book
appendix F Working with large datasets
F.1 Efficient programming
F.2 Storing data outside of RAM
F.3 Analytic packages for out-of-memory data
F.4 Comprehensive solutions for working with enormous datasets
appendix G Updating an R installation
G.1 Automated installation (Windows only)
G.2 Manual installation (Windows and macOS)
G.3 Updating an R installation (Linux)
references
index
Symbols
Numerics
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Z