A start-to-finish guide to one of the most useful programming languages for researchers in a variety of fields
In the newly revised Third Edition of The R Book, a team of distinguished teachers and researchers delivers a user-friendly and comprehensive discussion of foundational and advanced topics in the R software language, which is used widely in science, engineering, medicine, economics, and other fields. The book is designed to be used as both a complete text—readable from cover to cover—and as a reference manual for practitioners seeking authoritative guidance on particular topics.
This latest edition offers instruction on the use of the RStudio GUI, an easy-to-use environment for those new to R. It provides readers with a complete walkthrough of the R language, beginning at a point that assumes no prior knowledge of R and very little previous knowledge of statistics. Readers will also find:
• A thorough introduction to fundamental concepts in statistics and step-by-step roadmaps to their implementation in R;
• Comprehensive explorations of worked examples in R;
• A complementary companion website with downloadable datasets that are used in the book;
• In-depth examination of essential R packages.
Perfect for undergraduate and postgraduate students of science, engineering, medicine economics, and geography, The R Book will also earn a place in the libraries of social sciences professionals.
Author(s): Elinor Jones, Simon Harden, Michael J. Crawley
Edition: 3
Publisher: Wiley
Year: 2022
Language: English
Commentary: Vector PDF
Pages: 880
City: Hoboken, NJ
Tags: Probabilistic Models; Regression; R; Statistics; Hypothesis Testing; Survival Analysis; Time Series Analysis; Generalized Additive Models
Cover
Title Page
Copyright
Contents
List of Tables
Preface
Acknowledgments
About the Companion Website
Chapter 1 Getting Started
1.1 Navigating the book
1.1.1 How to use this book
1.2 R vs. RStudio
1.3 Installing R and RStudio
1.4 Using RStudio
1.4.1 Using R directly via the console
1.4.2 Using text editors
1.5 The Comprehensive R Archive Network
1.5.1 Manuals
1.5.2 Frequently asked questions
1.5.3 Contributed documentation
1.6 Packages in R
1.6.1 Contents of packages
1.6.2 Finding packages
1.6.3 Installing packages
1.7 Getting help in R
1.7.1 Worked examples of functions
1.7.2 Demonstrations of R functions
1.8 Good housekeeping
1.8.1 Variable types
1.8.2 What's loaded or defined in the current session
1.8.3 Attaching and detaching objects
1.8.4 Projects
1.9 Linking to other computer languages
1.9 References
Chapter 2 Technical Background
2.1 Mathematical functions
2.1.1 Logarithms and exponentials
2.1.2 Trigonometric functions
2.1.3 Power laws
2.1.4 Polynomial functions
2.1.5 Gamma function
2.1.6 Asymptotic functions
2.1.7 Sigmoid (S‐shaped) functions
2.1.8 Biexponential function
2.1.9 Transformations of model variables
2.2 Matrices
2.2.1 Matrix multiplication
2.2.2 Diagonals of matrices
2.2.3 Determinants
2.2.4 Inverse of a matrix
2.2.5 Eigenvalues and eigenvectors
2.2.6 Solving systems of linear equations using matrices
2.3 Calculus
2.3.1 Differentiation
2.3.2 Integration
2.3.3 Differential equations
2.4 Probability
2.4.1 The central limit theorem
2.4.2 Conditional probability
2.5 Statistics
2.5.1 Least squares
2.5.2 Maximum likelihood
2.5 Reference
Chapter 3 Essentials of the R Language
3.1 Calculations
3.1.1 Complex numbers
3.1.2 Rounding
3.1.3 Arithmetic
3.1.4 Modular arithmetic
3.1.5 Operators
3.1.6 Integers
3.2 Naming objects
3.3 Factors
3.4 Logical operations
3.4.1 TRUE, T, FALSE, F
3.4.2 Testing for equality of real numbers
3.4.3 Testing for equality of non‐numeric objects
3.4.4 Evaluation of combinations of TRUE and FALSE
3.4.5 Logical arithmetic
3.5 Generating sequences
3.5.1 Generating repeats
3.5.2 Generating factor levels
3.6 Class membership
3.7 Missing values, infinity, and things that are not numbers
3.7.1 Missing values: NA
3.8 Vectors and subscripts
3.8.1 Extracting elements of a vector using subscripts
3.8.2 Classes of vector
3.8.3 Naming elements within vectors
3.9 Working with logical subscripts
3.10 Vector functions
3.10.1 Obtaining tables using tapply ()
3.10.2 Applying functions to vectors using sapply ()
3.10.3 The aggregate () function for grouped summary statistics
3.10.4 Parallel minima and maxima: pmin and pmax
3.10.5 Finding closest values
3.10.6 Sorting, ranking, and ordering
3.10.7 Understanding the difference between unique () and duplicated ()
3.10.8 Looking for runs of numbers within vectors
3.10.9 Sets: union (), intersect (), and setdiff ()
3.11 Matrices and arrays
3.11.1 Matrices
3.11.2 Naming the rows and columns of matrices
3.11.3 Calculations on rows or columns of matrices
3.11.4 Adding rows and columns to matrices
3.11.5 The sweep () function
3.11.6 Applying functions to matrices
3.11.7 Scaling a matrix
3.11.8 Using the max.col () function
3.11.9 Restructuring a multi‐dimensional array using aperm ()
3.12 Random numbers, sampling, and shuffling
3.12.1 The sample () function
3.13 Loops and repeats
3.13.1 More complicated while () loops
3.13.2 Loop avoidance
3.13.3 The slowness of loops
3.13.4 Do not ‘grow’ data sets by concatenation or recursive function calls
3.13.5 Loops for producing time series
3.14 Lists
3.14.1 Summarising lists and lapply ()
3.14.2 Manipulating and saving lists
3.15 Text, character strings, and pattern matching
3.15.1 Pasting character strings together
3.15.2 Extracting parts of strings
3.15.3 Counting things within strings
3.15.4 Upper and lower case text
3.15.5 The match () function and relational databases
3.15.6 Pattern matching
3.15.7 Substituting text within character strings
3.15.8 Locations of a pattern within a vector
3.15.9 Comparing vectors using %in% and which ()
3.15.10 Stripping patterned text out of complex strings
3.16 Dates and times in R
3.16.1 Reading time data from files
3.16.2 Calculations with dates and times
3.16.3 Generating sequences of dates
3.16.4 Calculating time differences between the rows of a dataframe
3.16.5 Regression using dates and times
3.17 Environments
3.17.1 Using attach () or not!
3.17.2 Using attach () in this book
3.18 Writing R functions
3.18.1 Arithmetic mean of a single sample
3.18.2 Median of a single sample
3.18.3 Geometric mean
3.18.4 Harmonic mean
3.18.5 Variance
3.18.6 Variance ratio test
3.18.7 Using the variance
3.18.8 Plots and deparsing in functions
3.18.9 The switch () function
3.18.10 Arguments in our function
3.18.11 Errors in our functions
3.18.12 Outputs from our function
3.19 Structure of R objects
3.20 Writing from R to a file
3.20.1 Saving data objects
3.20.2 Saving command history
3.20.3 Saving graphics or plots
3.20.4 Saving data for a spreadsheet
3.20.5 Saving output from functions to a file
3.21 Tips for writing R code
3.21 References
Chapter 4 Data Input and Dataframes
4.1 Working directory
4.2 Data input from files
4.2.1 Data input using read.table () and read.csv ()
4.2.2 Input from files using scan ()
4.2.3 Reading data from a file using readLines ()
4.3 Data input directly from the web
4.4 Built‐in data files
4.5 Dataframes
4.5.1 Subscripts and indices
4.5.2 Selecting rows from the dataframe at random
4.5.3 Sorting dataframes
4.5.4 Using logical conditions to select rows from the dataframe
4.5.5 Omitting rows containing missing values, NA
4.5.6 A dataframe with row names instead of row numbers
4.5.7 Creating a dataframe from another kind of object
4.5.8 Eliminating duplicate rows from a dataframe
4.5.9 Dates in dataframes
4.6 Using the match () function in dataframes
4.6.1 Merging two dataframes
4.7 Adding margins to a dataframe
4.7.1 Summarising the contents of dataframes
Chapter 5 Graphics
5.1 Plotting principles
5.1.1 Axes labels and titles
5.1.2 Plotting symbols and colours
5.1.3 Saving graphics
5.2 Plots for single variables
5.2.1 Histograms vs. bar charts
5.2.2 Histograms
5.2.3 Density plots
5.2.4 Boxplots
5.2.5 Dotplots
5.2.6 Bar charts
5.2.7 Pie charts
5.3 Plots for showing two numeric variables
5.3.1 Scatterplot
5.3.2 Plots with many identical values
5.4 Plots for numeric variables by group
5.4.1 Boxplots by group
5.4.2 Dotplots by group
5.4.3 An inferior (but popular) option
5.5 Plots showing two categorical variables
5.5.1 Grouped bar charts
5.5.2 Mosaic plots
5.6 Plots for three (or more) variables
5.6.1 Plots of all pairs of variables
5.6.2 Incorporating a third variable on a scatterplot
5.6.3 Basic 3D plots
5.7 Trellis graphics
5.7.1 Panel boxplots
5.7.2 Panel scatterplots
5.7.3 Panel barplots
5.7.4 Panels for conditioning plots
5.7.5 Panel histograms
5.7.6 More panel functions
5.8 Plotting functions
5.8.1 Two‐dimensional plots
5.8.2 Three‐dimensional plots
5.8 References
Chapter 6 Graphics in More Detail
6.1 More on colour
6.1.1 Colour palettes with categorical data
6.1.2 The RColorBrewer package
6.1.3 Foreground colours
6.1.4 Background colours
6.1.5 Background colour for legends
6.1.6 Different colours for different parts of the graph
6.1.7 Full control of colours in plots
6.1.8 Cross‐hatching and grey scale
6.2 Changing the look of graphics
6.2.1 Shape and size of plot
6.2.2 Multiple plots on one screen
6.2.3 Tickmarks and associated labels
6.2.4 Font of text
6.3 Adding items to plots
6.3.1 Adding text
6.3.2 Adding smooth parametric curves to a scatterplot
6.3.3 Fitting non‐parametric curves through a scatterplot
6.3.4 Connecting observations
6.3.5 Adding shapes
6.3.6 Adding mathematical and other symbols
6.4 The grammar of graphics and ggplot2
6.4.1 Basic structure
6.4.2 Examples
6.5 Graphics cheat sheet
6.5.1 Text justification, adj
6.5.2 Annotation of graphs, ann
6.5.3 Delay moving on to the next in a series of plots, ask
6.5.4 Control over the axes, axis
6.5.5 Background colour for plots, bg
6.5.6 Boxes around plots, bty
6.5.7 Size of plotting symbols using the character expansion function, cex
6.5.8 Changing the shape of the plotting region, plt
6.5.9 Locating multiple graphs in non‐standard layouts using fig
6.5.10 Two graphs with a common X scale but different Y scales using fig
6.5.11 The layout function
6.5.12 Creating and controlling multiple screens on a single device
6.5.13 Orientation of numbers on the tick marks, las
6.5.14 Shapes for the ends and joins of lines, lend and ljoin
6.5.15 Line types, lty
6.5.16 Line widths, lwd
6.5.17 Several graphs on the same page, mfrow and mfcol
6.5.18 Margins around the plotting area, mar
6.5.19 Plotting more than one graph on the same axes, new
6.5.20 Outer margins, oma
6.5.21 Packing graphs closer together
6.5.22 Square plotting region, pty
6.5.23 Character rotation, srt
6.5.24 Rotating the axis labels
6.5.25 Tick marks on the axes
6.5.26 Axis styles
6.5.27 Summary
6.5 References
Chapter 7 Tables
7.1 Tabulating categorical or discrete data
7.1.1 Tables of counts
7.1.2 Tables of proportions
7.2 Tabulating summaries of numeric data
7.2.1 General summaries by group
7.2.2 Bespoke summaries by group
7.3 Converting between tables and dataframes
7.3.1 From a table to a dataframe
7.3.2 From a dataframe to a table
7.3 Reference
Chapter 8 Probability Distributions in R
8.1 Probability distributions: the basics
8.1.1 Discrete and continuous probability distributions
8.1.2 Describing probability distributions mathematically
8.1.3 Independence
8.2 Probability distributions in R
8.3 Continuous probability distributions
8.3.1 The Normal (or Gaussian) distribution
8.3.2 The Uniform distribution
8.3.3 The Chi‐squared distribution
8.3.4 The F distribution
8.3.5 Student's t distribution
8.3.6 The Gamma distribution
8.3.7 The Exponential distribution
8.3.8 The Beta distribution
8.3.9 The Lognormal distribution
8.3.10 The Logistic distribution
8.3.11 The Weibull distribution
8.3.12 Multivariate Normal distribution
8.4 Discrete probability distributions
8.4.1 The Bernoulli distribution
8.4.2 The Binomial distribution
8.4.3 The Geometric distribution
8.4.4 The Hypergeometric distribution
8.4.5 The Multinomial distribution
8.4.6 The Poisson distribution
8.4.7 The Negative Binomial distribution
8.5 The central limit theorem
8.5 References
Chapter 9 Testing
9.1 Principles
9.1.1 Defining the question to be tested
9.1.2 Assumptions
9.1.3 Interpreting results
9.2 Continuous data
9.2.1 Single population average
9.2.2 Two population averages
9.2.3 Multiple population averages
9.2.4 Population distribution
9.2.5 Checking and testing for normality
9.2.6 Comparing variances
9.3 Discrete and categorical data
9.3.1 Sign test
9.3.2 Test to compare proportions
9.3.3 Contingency tables
9.3.4 Testing contingency tables
9.4 Bootstrapping
9.5 Multiple tests
9.6 Power and sample size calculations
9.7 A table of tests
9.7 References
Chapter 10 Regression
10.1 The simple linear regression model
10.1.1 Model format and assumptions
10.1.2 Building a simple linear regression model
10.2 The multiple linear regression model
10.2.1 Model format and assumptions
10.2.2 Building a multiple linear regression model
10.2.3 Categorical covariates
10.2.4 Interactions between covariates
10.3 Understanding the output
10.3.1 Residuals
10.3.2 Estimates of coefficients
10.3.3 Testing individual coefficients
10.3.4 Residual standard error
10.3.5 R2 and its variants
10.3.6 The regression F‐test
10.3.7 ANOVA: Same model, different output
10.3.8 Extracting model information
10.4 Fitting models
10.4.1 The principle of parsimony
10.4.2 First plot the data
10.4.3 Comparing nested models
10.4.4 Comparing non‐nested models
10.4.5 Dealing with large numbers of covariates
10.5 Checking model assumptions
10.5.1 Residuals and standardised residuals
10.5.2 Checking for linearity
10.5.3 Checking for homoscedasticity of errors
10.5.4 Checking for normality of errors
10.5.5 Checking for independence of errors
10.5.6 Checking for influential observations
10.5.7 Checking for collinearity
10.5.8 Improving fit
10.6 Using the model
10.6.1 Interpretation of model
10.6.2 Making predictions
10.7 Further types of regression modelling
10.7 References
Chapter 11 Generalised Linear Models
11.1 How GLMs work
11.1.1 Error structure
11.1.2 Linear predictor
11.1.3 Link function
11.1.4 Model checking
11.1.5 Interpretation and prediction
11.2 Count data and GLMs
11.2.1 A straightforward example
11.2.2 Dispersion
11.2.3 An alternative to Poisson counts
11.3 Count table data and GLMs
11.3.1 Log‐linear models
11.3.2 All covariates might be useful
11.3.3 Spine plot
11.4 Proportion data and GLMs
11.4.1 Theoretical background
11.4.2 Logistic regression with binomial errors
11.4.3 Predicting x from y
11.4.4 Proportion data with categorical explanatory variables
11.4.5 Binomial GLM with ordered categorical covariates
11.4.6 Binomial GLM with categorical and continuous covariates
11.4.7 Revisiting lizards
11.5 Binary Response Variables and GLMs
11.5.1 A straightforward example
11.5.2 Graphical tests of the fit of the logistic curve to data
11.5.3 Mixed covariate types with a binary response
11.5.4 Spine plot and logistic regression
11.6 Bootstrapping a GLM
11.6 References
Chapter 12 Generalised Additive Models
12.1 Smoothing example
12.2 Straightforward examples of GAMs
12.3 Background to using GAMs
12.3.1 Smoothing
12.3.2 Suggestions for using gam ()
12.4 More complex GAM examples
12.4.1 Back to Ozone
12.4.2 An example with strongly humped data
12.4.3 GAMs with binary data
12.4.4 Three‐dimensional graphic output from gam
12.4 References
Chapter 13 Mixed‐Effect Models
13.1 Regression with categorical covariates
13.2 An alternative method: random effects
13.3 Common data structures where random effects are useful
13.3.1 Nested (hierarchical) structures
13.3.2 Non‐nested structures
13.3.3 Longitudinal structures
13.4 R packages to deal with mixed effects models
13.4.1 The nlme package
13.4.2 The lme4 package
13.4.3 Methods for fitting mixed models
13.5 Examples of implementing random effect models
13.5.1 Multilevel data (two levels)
13.5.2 Multilevel data (three levels)
13.5.3 Designed experiment: split‐plot
13.5.4 Longitudinal data
13.6 Generalised linear mixed models
13.6.1 Logistic mixed model
13.7 Alternatives to mixed models
13.7 References
Chapter 14 Non‐linear Regression
14.1 Example: modelling deer jaw bone length
14.1.1 An exponential model for the deer data
14.1.2 A Michaelis–Menten model for the deer data
14.1.3 Comparison of the exponential and the Michaelis–Menten model
14.2 Example: grouped data
14.3 Self‐starting functions
14.3.1 Self‐starting Michaelis–Menten model
14.3.2 Self‐starting asymptotic exponential model
14.3.3 Self‐starting logistic
14.3.4 Self‐starting four‐parameter logistic
14.4 Further considerations
14.4.1 Model checking
14.4.2 Confidence intervals
14.4 References
Chapter 15 Survival Analysis
15.1 Handling survival data
15.1.1 Structure of a survival dataset
15.1.2 Survival data in R
15.2 The survival and hazard functions
15.2.1 Non‐parametric estimation of the survival function
15.2.2 Parametric estimation of the survival function
15.3 Modelling survival data
15.3.1 The data
15.3.2 The Cox proportional hazard model
15.3.3 Accelerated failure time models
15.3.4 Cox proportional hazard or a parametric model?
15.3 References
Chapter 16 Designed Experiments
16.1 Factorial experiments
16.1.1 Expanding data
16.2 Pseudo‐replication
16.2.1 Split‐plot effects
16.2.2 Removing pseudo‐replication
16.2.3 Derived variable analysis
16.3 Contrasts
16.3.1 Contrast coefficients
16.3.2 An example of contrasts using R
16.3.3 Model simplification for contrasts
16.3.4 Helmert contrasts
16.3.5 Sum contrasts
16.3.6 Polynomial contrasts
16.3.7 Contrasts with multiple covariates
16.3 References
Chapter 17 Meta‐Analysis
17.1 Elements of a meta‐analysis
17.1.1 Choosing studies for a meta‐analysis
17.1.2 Effects and effect size
17.1.3 Weights
17.1.4 Fixed vs. random effect models
17.2 Meta‐analysis in R
17.2.1 Formatting information from studies
17.2.2 Computing the inputs of a meta‐analysis
17.2.3 Conducting the meta‐analysis
17.3 Examples
17.3.1 Meta‐analysis Of scaled differences
17.4 Meta‐analysis of categorical data
17.4 References
Chapter 18 Time Series
18.1 Moving average
18.2 Blowflies
18.3 Seasonal data
18.3.1 Point of view
18.3.2 Built in ts () functions
18.3.3 Cycles
18.3.4 Testing for a time series trend
18.4 Multiple time series
18.5 Some theoretical background
18.5.1 Autocorrelation
18.5.2 Autoregressive models
18.5.3 Partial autocorrelation
18.5.4 Moving average models
18.5.5 More general models: ARMA and ARIMA
18.6 ARIMA example
18.7 Simulation of time series
18.7 Reference
Chapter 19 Multivariate Statistics
19.1 Visualising data
19.2 Multivariate analysis of variance
19.3 Principal component analysis
19.4 Factor analysis
19.5 Cluster analysis
19.5.1 k‐means
19.6 Hierarchical cluster analysis
19.7 Discriminant analysis
19.8 Neural networks
19.8 References
Chapter 20 Classification and Regression Trees
20.1 How CARTs work
20.2 Regression trees
20.2.1 The tree package
20.2.2 The rpart package
20.2.3 Comparison with linear regression
20.2.4 Model simplification
20.3 Classification trees
20.3.1 Classification trees with categorical explanatory variables
20.3.2 Classification trees for replicated data
20.4 Looking for patterns
20.4 References
Chapter 21 Spatial Statistics
21.1 Spatial point processes
21.1.1 How can we check for randomness?
21.1.2 Models
21.1.3 Marks
21.2 Geospatial statistics
21.2.1 Models
21.2 References
Chapter 22 Bayesian Statistics
22.1 Components of a Bayesian Analysis
22.1.1 The likelihood (the model and data)
22.1.2 Priors
22.1.3 The Posterior
22.1.4 Markov chain Monte Carlo (MCMC)
22.1.5 Considerations for MCMC
22.1.6 Inference
22.1.7 The Pros and Cons of going Bayesian
22.2 Bayesian analysis in R
22.2.1 Installing JAGS
22.2.2 Running JAGS in R
22.2.3 Writing BUGS models
22.3 Examples
22.3.1 MCMC for a simple linear regression
22.3.2 MCMC for longitudinal data
22.4 MCMC for a model with binomial errors
22.4 References
Chapter 23 Simulation Models
23.1 Temporal dynamics
23.1.1 Chaotic dynamics in population size
23.1.2 Investigating the route to chaos
23.2 Spatial simulation models
23.2.1 Meta‐population dynamics
23.2.2 Coexistence resulting from spatially explicit (local) density dependence
23.2.3 Pattern generation resulting from dynamic interactions
23.3 Temporal and spatial dynamics: random walk
23.3 References
Index
EULA