Data Science with R A Step By Step Guide With Visual Illustrations and Examples

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

A Step By Step Guide with Visual Illustrations and Examples The Data Science field is expected to continue growing rapidly over the next several years and Data Scientist is consistently rated as a top career.Data Science with R gives you the necessery theoretical background to start your Data Science journey and shows you how to apply the R programming language through practical examples in order to extract valuable knowledge from data. Professor Andrew Oleksy guides you through all important concepts of data science including the R programming language, Data Mining, Clustering, Classification and Prediction, Hadoop framework and more.

Author(s): Andrew Oleksy
Publisher: Andrew Oleksy
Year: 2018

Language: English
Pages: 201
Tags: Data Science, Data Visualization, R

Table of Contents......Page 3
Prerequisite Knowledge......Page 10
1.1      Data Science......Page 11
1.2.1 Data Collection......Page 14
1.2.5 Interpretation and Evaluation......Page 15
1.3 Model Types......Page 16
1.4 Examples and Counterexamples......Page 17
1.5.2 Regression......Page 18
1.5.3 Clustering......Page 19
1.5.4 Extraction and Association Analysis......Page 20
1.5.6 Anomaly Detection......Page 21
1.6.1 Medicine......Page 22
1.6.2 Finance......Page 23
1.6.3 Telecommunications......Page 24
1.7 Challenges......Page 26
1.8 The R Programming Language......Page 27
1.9 Basic Concepts, Definitions and Notations......Page 29
1.10 Tool Installation......Page 30
Prerequisite Knowledge......Page 33
Introduction to R......Page 34
2.1.1 Definition and Object Classes......Page 35
2.1.2 Vectors and Lists......Page 36
2.1.3 Matrix......Page 38
2.1.4. Factors and Nominal Data......Page 39
2.1.6 Data Frames......Page 40
2.2.2 Sequence creation......Page 42
2.2.3 Reference to Subsets......Page 43
2.2.4 Vectorization......Page 46
2.3.1 Loops: for, repeat and while......Page 47
2.3.3 Next and break statements......Page 49
2.4 Functions......Page 50
2.5 Scoping Rules......Page 52
2.6.2 sapply......Page 53
2.6.3 Split......Page 54
2.6.4 tapply......Page 55
2.7 Help from the console and Package Installation......Page 57
Prerequisite Knowledge......Page 58
Types, Quality and Data Preprocessing......Page 59
3.1 Categories and Types of Variables......Page 60
3.2.1.2 Data with Noise......Page 61
Example – Data smoothing using binning methods......Page 62
3.2.2 Data Unification......Page 63
3.2.3.1 Data Transformation......Page 64
Example – Data Regularization......Page 65
3.2.3.2 Data Discretization......Page 66
Example – Entropy-based discretization......Page 67
3.2.4.1 Dimension Reduction......Page 70
3.2.4.2 Data Compression......Page 71
3.3.1 dplyr......Page 74
3.3.2 tidyr......Page 78
Summary Statistics and Visualization......Page 83
4.1.2 Median......Page 84
4.2.1 Minimum value, Maximum value, Range......Page 86
4.2.2 Percentile values......Page 87
4.2.4 Variance......Page 88
4.2.5 Standard Deviation......Page 89
4.2.6 Coefficient of Variation......Page 90
4.3.1 Frequency Table......Page 91
4.3.3 Pie Chart......Page 92
4.3.4 Contingency Matrix......Page 93
4.3.4 Stacked Bar Charts and Grouped Bar Charts......Page 94
4.4.2. Histograms......Page 98
4.4.3 Frequency Polygon......Page 102
4.4.4 Boxplot......Page 103
Prerequisite Knowledge......Page 106
5.1.2.1 Description......Page 107
5.1.2.2 Decision Tree creation – ID3 Algorithm......Page 108
5.1.2.3 Decision Tree creation – Gini Index......Page 114
5.2.2.1 Description, Definitions and Notations......Page 118
5.2.2.3 Gradient Descent Algorithm......Page 119
5.2.2.4 Gradient Descent in Linear Regression......Page 121
5.2.2.5 Learning Parameter......Page 122
5.3.2 Model Regularization......Page 125
5.3.3 Linear Regression with Normalization......Page 126
Prerequisite Knowledge......Page 128
6.1 Unsupervised Learning......Page 129
6.2 Concept of Cluster......Page 130
6.3.2 Random Centroids Initialization......Page 131
6.3.3 Choosing the number of Clusters......Page 132
6.3.4 Applying k-means in R......Page 133
6.4.1 Distance Measurements Between Clusters......Page 136
6.4.4 Applying Hierarchical Clustering in R......Page 139
6.5.1 Basic Concepts......Page 142
6.5.2 Algorithm Description......Page 143
6.5.4 Advantages......Page 144
6.5.5 Disadvantages......Page 145
Mining of Frequent Itemsets and Association Rules......Page 147
7.1 Introduction......Page 148
7.2 Theoretical Background......Page 150
7.3 Apriori Algorithm......Page 152
7.4 Frequent Itemsets Types......Page 155
7.5 Positive and Negative Border of Frequent Itemsets......Page 156
7.6 Association Rules Mining......Page 157
7.7.1 Sampling Algorithm......Page 159
7.7.2 Partitioning Algorithm......Page 160
7.8 FP-Growth Algorithm......Page 161
7.9 Arules package......Page 165
Prerequisite Knowledge......Page 169
8.1 Introduction......Page 170
8.2 Advantages of Hadoop’s Distributed File System......Page 172
8.3 Hadoop Users......Page 174
8.4.2 HDFS Architecture......Page 175
8.5.3.3 Multiple Data Recording Nodes, Arbitrary File Modifications......Page 176
8.4.4.1 Blocks......Page 177
8.4.4.2 Namenodes and Datanodes......Page 178
8.4.4.3 HDFS Federation......Page 179
8.4.4.4 HDFS High Availability......Page 180
8.4.5 Data Flow – Data Reading......Page 182
8.4.6 Network Topology in Hadoop......Page 184
8.4.7 File Writing......Page 185
8.4.8 Copies Placement......Page 188
8.4.9 Consistency Model......Page 189
8.5 The Hadoop Cluster Architecture......Page 191
8.6 Hadoop Java API......Page 192
8.7.1 Generic Classes and Methods......Page 199
8.7.2 The Class Object......Page 200