Practical Data Science With R

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Practical Data Science with R, Second Edition takes a practice-oriented approach to explaining basic principles in the ever expanding field of data science. You’ll jump right to real-world use cases as you apply the R programming language and statistical analysis techniques to carefully explained examples based in marketing, business intelligence, and decision support. Evidence-based decisions are crucial to success. Applying the right data analysis techniques to your carefully curated business data helps you make accurate predictions, identify trends, and spot trouble in advance. The R data analysis platform provides the tools you need to tackle day-to-day data analysis and machine learning tasks efficiently and effectively. Source Code available here: https://www.manning.com/downloads/2043

Author(s): Nina Zumel, John Mount, Jeremy Howard, Rachel Thomas
Edition: 2nd Edition
Publisher: Manning Publications
Year: 2020

Language: English
Pages: 568
Tags: Software Design Tools, Artificial Intelligence, Mathematical & Statistical Software, Data Science, R: Programming Language

Practical Data Science with R......Page 1
brief contents......Page 9
contents......Page 11
foreword......Page 17
preface......Page 18
acknowledgments......Page 19
What is data science?......Page 20
Roadmap......Page 21
Audience......Page 23
Code conventions and downloads......Page 24
Working with this book......Page 25
Book forum......Page 26
about the authors......Page 27
about the foreword authors......Page 28
about the cover illustration......Page 29
Part 1 Introduction to data science......Page 31
1 The data science process......Page 33
1.1.1 Project roles......Page 34
1.2 Stages of a data science project......Page 36
1.2.1 Defining the goal......Page 37
1.2.2 Data collection and management......Page 38
1.2.3 Modeling......Page 40
1.2.4 Model evaluation and critique......Page 42
1.2.5 Presentation and documentation......Page 44
1.2.6 Model deployment and maintenance......Page 45
1.3.1 Determining lower bounds on model performance......Page 46
Summary......Page 47
2 Starting with R and data......Page 48
2.1 Starting with R......Page 49
2.1.2 R programming......Page 50
2.2.1 Working with well-structured data from files or URLs......Page 59
2.2.2 Using R with less-structured data......Page 64
2.3 Working with relational databases......Page 67
2.3.1 A production-size example......Page 68
Summary......Page 79
3 Exploring data......Page 81
3.1 Using summary statistics to spot problems......Page 83
3.1.1 Typical problems revealed by data summaries......Page 84
3.2 Spotting problems using graphics and visualization......Page 88
3.2.1 Visually checking distributions for a single variable......Page 90
3.2.2 Visually checking relationships between two variables......Page 100
Summary......Page 116
4.1 Cleaning data......Page 118
4.1.1 Domain-specific data cleaning......Page 119
4.1.2 Treating missing values......Page 121
4.1.3 The vtreat package for automatically treating missing variables......Page 125
4.2 Data transformations......Page 128
4.2.1 Normalization......Page 129
4.2.2 Centering and scaling......Page 131
4.2.3 Log transformations for skewed and wide distributions......Page 134
4.3 Sampling for modeling and validation......Page 137
4.3.1 Test and training splits......Page 138
4.3.2 Creating a sample group column......Page 139
4.3.3 Record grouping......Page 140
4.3.4 Data provenance......Page 141
Summary......Page 142
5 Data engineering and data shaping......Page 143
5.1.1 Subsetting rows and columns......Page 146
5.1.2 Removing records with incomplete data......Page 151
5.1.3 Ordering rows......Page 154
5.2.1 Adding new columns......Page 158
5.2.2 Other simple operations......Page 163
5.3.1 Combining many rows into summary rows......Page 164
5.4.1 Combining two or more ordered data frames quickly......Page 167
5.4.2 Principal methods to combine data from multiple tables......Page 173
5.5.1 Moving data from wide to tall form......Page 179
5.5.2 Moving data from tall to wide form......Page 183
Summary......Page 188
Part 2 Modeling methods......Page 191
6 Choosing and evaluating models......Page 193
6.1 Mapping problems to machine learning tasks......Page 194
6.1.1 Classification problems......Page 195
6.1.2 Scoring problems......Page 196
6.1.3 Grouping: working without known targets......Page 197
6.1.4 Problem-to-method mapping......Page 199
6.2.1 Overfitting......Page 200
6.2.2 Measures of model performance......Page 204
6.2.3 Evaluating classification models......Page 205
6.2.4 Evaluating scoring models......Page 215
6.2.5 Evaluating probability models......Page 217
6.3 Local interpretable model-agnostic explanations (LIME) for explaining model predictions......Page 225
6.3.2 Walking through LIME: A small example......Page 227
6.3.3 LIME for text classification......Page 234
6.3.4 Training the text classifier......Page 238
6.3.5 Explaining the classifier’s predictions......Page 239
Summary......Page 244
7 Linear and logistic regression......Page 245
7.1 Using linear regression......Page 246
7.1.1 Understanding linear regression......Page 247
Equation 7.1 The expression for a linear regression model......Page 248
7.1.2 Building a linear regression model......Page 251
7.1.3 Making predictions......Page 252
7.1.4 Finding relations and extracting advice......Page 258
7.1.5 Reading the model summary and characterizing coefficient quality......Page 260
7.2.1 Understanding logistic regression......Page 267
Equation 7.2 The expression for a logistic regression model......Page 270
7.2.2 Building a logistic regression model......Page 272
7.2.3 Making predictions......Page 273
7.2.4 Finding relations and extracting advice from logistic models......Page 278
7.2.5 Reading the model summary and characterizing coefficients......Page 279
7.2.6 Logistic regression takeaways......Page 286
7.3.1 An example of quasi-separation......Page 287
7.3.2 The types of regularized regression......Page 292
7.3.3 Regularized regression with glmnet......Page 293
Summary......Page 303
8 Advanced data preparation......Page 304
8.1 The purpose of the vtreat package......Page 305
8.2 KDD and KDD Cup 2009......Page 307
8.2.1 Getting started with KDD Cup 2009 data......Page 308
8.2.2 The bull-in-the-china-shop approach......Page 310
8.3 Basic data preparation for classification......Page 312
8.3.1 The variable score frame......Page 314
8.3.2 Properly using the treatment plan......Page 318
8.4.1 Using mkCrossFrameCExperiment()......Page 320
8.4.2 Building a model......Page 322
8.5 Preparing data for regression modeling......Page 327
8.6.1 The vtreat phases......Page 329
8.6.2 Missing values......Page 331
8.6.3 Indicator variables......Page 332
8.6.4 Impact coding......Page 333
8.6.6 The cross-frame......Page 335
Summary......Page 339
9 Unsupervised methods......Page 341
9.1 Cluster analysis......Page 342
9.1.1 Distances......Page 343
9.1.2 Preparing the data......Page 346
9.1.3 Hierarchical clustering with hclust......Page 349
9.1.4 The k-means algorithm......Page 362
9.1.5 Assigning new points to clusters......Page 368
9.2.1 Overview of association rules......Page 370
9.2.2 The example problem......Page 372
9.2.3 Mining association rules with the arules package......Page 373
9.2.4 Association rule takeaways......Page 381
Summary......Page 382
10 Exploring advanced methods......Page 383
10.1 Tree-based methods......Page 385
10.1.1 A basic decision tree......Page 386
10.1.2 Using bagging to improve prediction......Page 389
10.1.3 Using random forests to further improve prediction......Page 391
10.1.4 Gradient-boosted trees......Page 398
10.2.1 Understanding GAMs......Page 406
10.2.2 A one-dimensional regression example......Page 408
10.2.3 Extracting the non-linear relationships......Page 412
10.2.4 Using GAM on actual data......Page 414
10.2.5 Using GAM for logistic regression......Page 417
10.2.6 GAM takeaways......Page 418
10.3 Solving “inseparable” problems using support vector machines......Page 419
10.3.1 Using an SVM to solve a problem......Page 420
10.3.2 Understanding support vector machines......Page 425
10.3.3 Understanding kernel functions......Page 427
10.3.4 Support vector machine and kernel methods takeaways......Page 429
Summary......Page 430
Part 3 Working in the real world......Page 431
11 Documentation and deployment......Page 433
11.1 Predicting buzz......Page 435
11.2 Using R markdown to produce milestone documentation......Page 436
11.2.1 What is R markdown?......Page 437
11.2.2 knitr technical details......Page 439
11.2.3 Using knitr to document the Buzz data and produce the model......Page 441
11.3.1 Writing effective comments......Page 444
11.3.2 Using version control to record history......Page 446
11.3.3 Using version control to explore your project......Page 452
11.3.4 Using version control to share work......Page 454
11.4 Deploying models......Page 458
11.4.1 Deploying demonstrations using Shiny......Page 460
11.4.2 Deploying models as HTTP services......Page 461
11.4.3 Deploying models by export......Page 463
11.4.4 What to take away......Page 465
Summary......Page 466
12 Producing effective presentations......Page 467
12.1 Presenting your results to the project sponsor......Page 469
12.1.1 Summarizing the project’s goals......Page 470
12.1.2 Stating the project’s results......Page 472
12.1.3 Filling in the details......Page 474
12.1.5 Project sponsor presentation takeaways......Page 476
12.2.1 Summarizing the project goals......Page 477
12.2.2 Showing how the model fits user workflow......Page 478
12.2.3 Showing how to use the model......Page 480
12.3.1 Introducing the problem......Page 482
12.3.2 Discussing related work......Page 483
12.3.3 Discussing your approach......Page 484
12.3.4 Discussing results and future work......Page 485
Summary......Page 487
A.1.1 Installing Tools......Page 489
A.1.3 Installing Git......Page 494
A.1.5 R resources......Page 495
A.2 Starting with R......Page 496
A.2.1 Primary features of R......Page 498
A.2.2 Primary R data types......Page 501
A.3.1 Running database queries using a query generator......Page 507
A.3.2 How to think relationally about data......Page 511
A.4 The takeaway......Page 513
appendix B Important statistical concepts......Page 514
B.1.1 Normal distribution......Page 515
B.1.3 Lognormal distribution......Page 520
B.1.4 Binomial distribution......Page 524
B.2.1 Statistical philosophy......Page 529
B.2.2 A/B tests......Page 532
B.2.3 Power of tests......Page 536
B.2.4 Specialized statistical tests......Page 538
B.3.1 Sampling bias......Page 540
B.3.2 Omitted variable bias......Page 543
B.4 The takeaway......Page 548
appendix C Bibliography......Page 549
B......Page 553
C......Page 554
D......Page 555
G......Page 557
I......Page 558
L......Page 559
M......Page 560
P......Page 561
R......Page 562
S......Page 563
U......Page 565
Z......Page 566