It seems that most introductory R books spend too much time with correlations and other modeling. I am still hoping to find an R book that deals primarily with data manipulation and descriptive graphics at an intro to intermediate level. Simply put, knowing something well and conveying it properly to your audience are often mutually exclusive.
Author(s): John Maindonald, John Braun
Series: Cambridge Series in Statistical and Probabilistic Mathematics
Edition: 2
Publisher: Cambridge University Press
Year: 2006
Language: English
Pages: 540
Cover......Page 1
Half-title......Page 3
Series-title......Page 6
Title......Page 7
Copyright......Page 8
Dedication......Page 9
Contents......Page 11
Preface......Page 21
Using the console (or command line) window......Page 27
Entry of data at the command line......Page 28
Collection of vectors into a data frame......Page 29
Quitting R......Page 30
R offers an extensive collection of functions......Page 31
1.1.3 Online help......Page 32
Wide-ranging searches......Page 33
1.2.1 Reading data from a file......Page 34
Data sets that accompany R packages......Page 35
1.3.2 Concatenation – joining vector objects......Page 36
1.3.4 Patterned data......Page 37
1.3.5 Missing values......Page 38
1.3.6 Factors......Page 39
1.4 Data frames and matrices......Page 40
Data frames are a specialized type of list......Page 41
Subsets of data frames......Page 42
1.4.3 Data frames and matrices......Page 43
1.5.1 Built-in functions......Page 44
Data summary functions – table( ) and sapply( )......Page 45
1.5.2 Generic functions and the class of an object......Page 46
The structure of functions......Page 47
1.5.4 Relational and logical operators and operations......Page 48
Identification of rows that include missing values......Page 49
1.6 Graphics in R......Page 50
1.6.1 The function plot( ) and allied functions......Page 51
1.6.3 The importance of aspect ratio......Page 53
1.6.5 The plotting of expressions and mathematical symbols......Page 54
1.6.7 Plot methods for objects other than vectors......Page 55
1.6.10 Good and bad graphs......Page 56
Panels of scatterplots – the use of xyplot( )......Page 57
Selected lattice functions......Page 58
Workspace management strategies......Page 59
Cosmetic issues......Page 60
Common sources of difficulty......Page 61
1.10 Further reading......Page 62
1.11 Exercises......Page 63
2.1 Revealing views of the data......Page 69
Histograms and density plots......Page 70
The stem-and-leaf display......Page 72
Boxplots......Page 73
2.1.2 Patterns in univariate time series......Page 74
2.1.3 Patterns in bivariate data......Page 76
What is the appropriate scale?......Page 77
Example: eggs of cuckoos......Page 78
2.1.5 Multiple variables and times......Page 80
2.1.6 Scatterplots, broken down by multiple factors......Page 82
Asymmetry of the distribution......Page 84
2.2 Data summary......Page 85
2.2.1 Counts......Page 86
Addition over one or more margins of a table......Page 87
Cross-tabulation – the xtabs( ) function......Page 88
Summary as a prelude to analysis – aggregate( )......Page 89
The benefits of data summary – dengue status example......Page 91
2.2.3 Standard deviation and inter-quartile range......Page 92
The pooled standard deviation......Page 93
2.2.4 Correlation......Page 94
2.3 Statistical analysis questions, aims and strategies......Page 95
2.3.2 Helpful and unhelpful questions......Page 96
2.3.3 How will results be used?......Page 97
Questionnaires and surveys......Page 98
2.3.6 Planning the formal analysis......Page 99
2.4 Recap......Page 100
2.6 Exercises......Page 101
3 Statistical models......Page 104
3.1.2 Models that include a random component......Page 105
Generalizing from models......Page 106
3.1.3 Fitting models – the model formula......Page 108
3.2 Distributions: models for the random component......Page 109
Binomial distribution......Page 110
Means, variances and standard deviations......Page 111
Normal distribution......Page 112
Other continuous distributions......Page 113
3.3.1 Simulation......Page 114
3.3.2 Sampling from populations......Page 115
3.4 Model assumptions......Page 116
3.4.1 Random sampling assumptions – independence......Page 117
The normal probability plot......Page 118
The sample plot, set alongside plots for random normal data......Page 119
Formal statistical testing for normality?......Page 120
3.4.5 Why models matter – adding across contingency tables......Page 121
3.5 Recap......Page 122
3.7 Exercises......Page 123
Why use the sample mean as an estimator?......Page 127
4.1.3 Assessing accuracy – the standard error......Page 128
4.1.4 The standard error for the difference of means......Page 129
4.1.6 The sampling distribution of the t-statistic......Page 130
Calculations for the t-distribution......Page 132
Confidence intervals of 95% or 99%......Page 133
Tests of hypotheses......Page 134
A summary of one- and two-sample calculations......Page 135
When is pairing helpful?......Page 136
Different ways to report results......Page 137
4.2.3 Confidence intervals for the correlation......Page 139
4.2.4 Confidence intervals versus hypothesis tests......Page 140
The mechanics of the calculation......Page 141
An example where a chi-squared test may not be valid......Page 142
4.3.1 Rare and endangered plant species......Page 143
Examination of departures from a consistent overall row pattern......Page 144
Interpretation issues......Page 145
4.4 One-way unstructured comparisons......Page 146
Is the analysis valid?......Page 149
Microarray data . severe multiplicity......Page 150
4.4.3 Data with a two-way structure, that is, two factors......Page 151
4.5 Response curves......Page 152
4.6 Data with a nested variation structure......Page 153
4.6.1 Degrees of freedom considerations......Page 154
4.7.1 The one-sample permutation test......Page 155
4.7.2 The two-sample permutation test......Page 156
4.7.3 Estimating the standard error of the median: bootstrapping......Page 157
The median......Page 159
4.8 Theories of inference......Page 160
4.8.1 Maximum likelihood estimation......Page 161
4.8.3 If there is strong prior information, use it!......Page 162
Dos and don’ts......Page 163
4.10.1 References for further reading......Page 164
4.11 Exercises......Page 165
5.1 Fitting a line to data......Page 170
5.1.1 Lawn roller example......Page 171
5.1.2 Calculating fitted values and residuals......Page 172
5.1.3 Residual plots......Page 173
5.1.4 Iron slag example: is there a pattern in the residuals?......Page 174
5.1.5 The analysis of variance table......Page 176
5.2 Outliers, influence and robust regression......Page 177
5.3.1 Confidence intervals and tests for the slope......Page 179
5.3.2 SEs and confidence intervals for predicted values......Page 180
5.3.3 Implications for design......Page 181
5.4.1 Issues of power......Page 183
5.5.1 Training/test sets and cross-validation......Page 184
5.5.2 Cross-validation -- an example......Page 185
5.5.3 Bootstrapping......Page 187
Commentary......Page 189
5.6.1 General power transformations......Page 190
5.7 Size and shape data......Page 191
5.7.1 Allometric growth......Page 192
An alternative to a regression line......Page 193
5.8 The model matrix in regression......Page 194
5.9 Recap......Page 195
5.11 Exercises......Page 196
6.1 Basic ideas: book weight and brain weight examples......Page 199
6.1.2 Diagnostic plots......Page 202
6.1.3 Example: brain weight......Page 204
6.1.4 Plots that show the contribution of individual terms......Page 206
6.2 Multiple regression assumptions and diagnostics......Page 208
Leverage and the hat matrix......Page 209
6.2.2 Influence on the regression coefficients......Page 210
6.2.5 The uses of model diagnostics......Page 211
6.3.1 Preliminaries......Page 212
6.3.3 An example – the Scottish hill race data......Page 213
Inclusion of an interaction term......Page 216
The model without the interaction term......Page 218
6.4.1 R2 and adjusted R2......Page 219
6.4.3 How accurately does the equation predict?......Page 220
6.5.1 Book dimensions and book weight......Page 222
6.6 Problems with many explanatory variables......Page 225
Variable selection – a simulation with random data......Page 226
6.7.1 A contrived example......Page 228
An analysis that makes modest sense......Page 231
6.7.3 Remedies for multicollinearity......Page 232
Measurement of dietary intake......Page 233
A simulation of the effect of measurement error......Page 234
Errors in variables – multiple regression......Page 235
6.8.3 Missing explanatory variables......Page 236
6.8.5 Non-linear methods – an alternative to transformation?......Page 238
6.10 Further reading......Page 240
6.10.1 References for further reading......Page 241
6.11 Exercises......Page 242
7 Exploiting the linear model framework......Page 245
7.1.1 Example -- sugar weight......Page 246
7.1.2 Different choices for the model matrix when there are factors......Page 249
7.2.1 Analysis of the rice data, allowing for block effects......Page 250
7.2.2 A balanced incomplete block design......Page 252
7.3 Fitting multiple lines......Page 253
7.4 Polynomial regression......Page 257
7.4.1 Issues in the choice of model......Page 259
7.5 Methods for passing smooth curves through data......Page 260
7.5.1 Scatterplot smoothing – regression splines......Page 261
7.5.3 Other smoothing methods......Page 265
Monotone curves......Page 266
7.6 Smoothing terms in additive models......Page 267
7.8 Exercises......Page 269
8.1.1 Transformation of the expected value on the left......Page 272
8.1.3 Log odds in contingency tables......Page 273
8.1.4 Logistic regression with a continuous explanatory variable......Page 274
8.2 Logistic multiple regression......Page 277
8.2.1 Selection of model terms and fitting the model......Page 279
8.2.2 A plot of contributions of explanatory variables......Page 282
8.2.3 Cross-validation estimates of predictive accuracy......Page 283
8.3 Logistic models for categorical data – an example......Page 284
8.4.1 Data on aberrant crypt foci......Page 286
8.4.2 Moth habitat example......Page 289
An unsatisfactory choice of reference level......Page 290
A more satisfactory choice of reference level......Page 292
The comparison between Bank and other habitats......Page 293
Diagnostic plots......Page 294
Quasi-binomial and quasi-Poisson models......Page 295
8.5.3 Leverage for binomial models......Page 296
Exploratory analysis......Page 297
.Proportional odds logistic regression......Page 298
8.6.2 Loglinear models......Page 300
8.7 Survival analysis......Page 301
8.7.1 Analysis of the Aids2 data......Page 302
8.7.2 Right censoring prior to the termination of the study......Page 304
8.7.4 Hazard rates......Page 305
8.7.5 The Cox proportional hazards model......Page 306
8.8 Transformations for count data......Page 308
8.9.1 References for further reading......Page 309
8.10 Exercises......Page 310
9.1.1 Preliminary graphical explorations......Page 312
9.1.2 The autocorrelation function......Page 313
The AR(1) model......Page 314
The general AR(p) model......Page 315
9.1.4 Autoregressive moving average models – theory......Page 316
9.2 Regression modeling with moving average errors......Page 317
Other checks and computations......Page 322
9.3 Non-linear time series......Page 323
9.5 Further reading......Page 324
9.6 Exercises......Page 325
10 Multi-level models and repeated measures......Page 327
10.1 A one-way random effects model......Page 328
Interpreting the mean squares......Page 329
Details of the calculations......Page 330
Nested factors – a variety of applications......Page 331
Relations between variance components and mean squares......Page 332
Interpretation of variance components......Page 333
10.1.3 Analysis using lmer( )......Page 334
Fitted values and residuals in lmer( )......Page 335
Uncertainty in the parameter estimates......Page 336
10.2.1 Alternative models......Page 337
Ignoring the random structure in the data......Page 342
10.3 A multi-level experimental design......Page 343
10.3.1 The anova table......Page 345
10.3.2 Expected values of mean squares......Page 346
10.3.3 The sums of squares breakdown......Page 347
10.3.4 The variance components......Page 350
Plots of residuals......Page 351
10.3.7 Different sources of variance -- complication or focus of interest?......Page 353
10.4 Within- and between-subject effects......Page 354
10.4.1 Model selection......Page 355
10.4.2 Estimates of model parameters......Page 356
10.5 Repeated measures in time......Page 358
Correlation structure......Page 359
10.5.1 Example – random variation between profiles......Page 360
A random coefficients model......Page 362
Preliminary data exploration......Page 365
A random coefficients model......Page 367
10.6.1 Predictions from models with a complex error structure......Page 369
10.7.1 An historical perspective on multi-level models......Page 370
10.8 Recap......Page 372
Multi-level models and repeated measures......Page 373
10.10 Exercises......Page 374
Note......Page 376
When are tree-based methods appropriate?......Page 377
11.2 Detecting email spam – an example......Page 378
11.3.1 Choosing the split – regression trees......Page 381
11.3.2 Within and between sums of squares......Page 382
11.3.3 Choosing the split – classification trees......Page 383
11.3.4 Tree-based regression versus loess regression smoothing......Page 384
11.4 Predictive accuracy and the cost–complexity tradeoff......Page 386
11.4.2 The cost--complexity parameter......Page 387
11.4.3 Prediction error versus tree size......Page 388
11.5 Data for female heart attack patients......Page 389
11.5.2 Printed information on each split......Page 391
11.6 Detecting email spam – the optimal tree......Page 392
How does the one-standard-error rule affect accuracy estimates?......Page 393
11.7 The randomForest package......Page 394
Comparison between rpart( ) and randomForest( )......Page 396
11.8.1 The combining of tree-based methods with other approaches......Page 397
11.8.6 Summary of pluses and minuses of tree-based methods......Page 398
11.9.1 References for further reading......Page 399
11.10 Exercises......Page 400
12 Multivariate data exploration and discrimination......Page 401
12.1.1 Scatterplot matrices......Page 402
12.1.2 Principal components analysis......Page 403
Preliminary data scrutiny......Page 405
The stability of the principal components plot......Page 407
Binary data......Page 409
12.2.1 Example – plant architecture......Page 410
Notation......Page 411
Predictive accuracy......Page 412
12.2.3 Linear discriminant analysis......Page 413
12.2.4 An example with more than two groups......Page 414
12.3 High-dimensional data, classification and plots......Page 416
What groups are of interest?......Page 417
12.3.1 Classifications and associated graphs......Page 418
12.3.2 Flawed graphs......Page 419
Distributional extremes......Page 421
12.3.3 Accuracies and scores for test data......Page 423
Cross-validation to determine the optimum number of features......Page 426
Which features?......Page 428
12.3.4 Graphs derived from the cross-validation process......Page 429
Further comments......Page 430
12.4 Further reading......Page 431
12.5 Exercises......Page 432
13.1 Principal component scores in regression......Page 434
The labor training data......Page 438
Potential sources of bias......Page 439
Data exploration......Page 440
13.2.1 Regression analysis, using all covariates......Page 441
13.2.2 The use of propensity scores......Page 443
13.3.1 References for further reading......Page 445
13.4 Exercises......Page 446
14.1.2 Workspace management......Page 447
Changing the working directory and/or workspace......Page 448
14.2 Data input and output......Page 449
The function read.table() and its variants......Page 450
Input of fixed format data......Page 451
Example – input of the Boston housing data......Page 452
Database connections......Page 453
Output to a file using cat( )......Page 454
Issues for the writing and use of functions......Page 455
The … argument......Page 456
14.3.3 Anonymous functions......Page 457
14.3.4 Functions for working with dates (and times)......Page 458
14.3.5 Creating groups......Page 459
14.4 Factors......Page 460
Factor contrasts......Page 461
Tests for main effects in the presence of interactions?......Page 462
Counting and identifying NAs – the use of table( )......Page 463
Sorting and ordering, where there are NAs......Page 464
14.6 Matrices and arrays......Page 465
14.6.1 Matrix arithmetic......Page 466
14.6.2 Outer products......Page 467
14.6.3 Arrays......Page 468
14.7.1 Lists – an extension of the notion of ``vector''......Page 469
The dual identity of data frames......Page 470
14.7.3 Merging data frames -- merge( )......Page 471
The tapply( ) function......Page 472
The functions lapply( ) and sapply( )......Page 473
14.7.7 Multivariate time series......Page 474
14.8.1 Printing and summarizing model objects......Page 475
14.8.3 S4 classes and methods......Page 476
14.9.1 Model and graphics formulae......Page 477
14.9.2 The use of a list to pass parameter values......Page 478
14.9.4 Environments......Page 479
Automatic naming of a file that holds function output......Page 480
Example – a function that identifies objects added during a session......Page 481
14.10 Document preparation --- Sweave()......Page 482
14.11.3 Plotting characters, symbols, line types and colors......Page 483
Colors......Page 485
Formatting and plotting of text and equations......Page 486
Symbolic substitution in parallel......Page 487
14.12 Lattice graphics and the grid package......Page 488
Annotation – the auto.key, key and legend arguments......Page 489
14.12.2 Use of grid.text() to label points......Page 490
14.12.3 Multiple lattice graphs on a graphics page......Page 491
14.13.2 References for further reading......Page 492
14.14 Exercises......Page 493
Epilogue – models......Page 496
Statistical models in genomics......Page 497
Other models yet!......Page 498
References for further reading......Page 499
Methodological references......Page 500
References for data sets......Page 505
References for packages......Page 508
Acknowledgments for use of data......Page 509
Index of R symbols and functions......Page 511
Index of terms......Page 517
Index of authors......Page 527