Software Implementation Illustrated with R and PythonAbout This Book* Learn the nature of data through software which takes the preliminary concepts right away using R and Python.* Understand data modeling and visualization to perform efficient statistical analysis with this guide.* Get well versed with techniques such as regression, clustering, classification, support vector machines and much more to learn the fundamentals of modern statistics.Who This Book Is ForIf you want to have a brief understanding of the nature of data and perform advanced statistical analysis using both R and Python, then this book is what you need. No prior knowledge is required. Aspiring data scientist, R users trying to learn Python and vice versaWhat You Will Learn* Learn the nature of data through software with preliminary concepts right away in R* Read data from various sources and export the R output to other software* Perform effective data visualization with the nature of variables and rich alternative options* Do exploratory data analysis for useful first sight understanding building up to the right attitude towards effective inference* Learn statistical inference through simulation combining the classical inference and modern computational power* Delve deep into regression models such as linear and logistic for continuous and discrete regressands for forming the fundamentals of modern statistics* Introduce yourself to CART - a machine learning tool which is very useful when the data has an intrinsic nonlinearityIn DetailStatistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions.This book explores statistical concepts along with R and Python, which are well integrated from the word go. Almost every concept has an R code going with it which exemplifies the strength of R and applications. The R code and programs have been further strengthened with equivalent Python programs. Thus, you will first understand the data characteristics, descriptive statistics and the exploratory attitude, which will give you firm footing of data analysis. Statistical inference will complete the technical footing of statistical methods. Regression, linear, logistic modeling, and CART, builds the essential toolkit. This will help you complete complex problems in the real world.You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python.The data analysis journey begins with exploratory analysis, which is more than simple, descriptive, data summaries. You will then apply linear regression modeling, and end with logistic regression, CART, and spatial statistics.By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.Style and approachDeveloping better and smarter ways to analyze data. Making better decisions/future predictions. Learn how to explore, visualize and perform statistical analysis. Better and efficient statistical and computational methods. Perform practical examples to master your learning
Author(s): Prabhanjan Narayanachar Tattar
Edition: 2
Year: 2017
Language: English
Pages: 432
Cover
Copyright
Credits
About the Author
Acknowledgment
About the Reviewers
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Data Characteristics
Questionnaire and its components
Understanding the data characteristics in an R environment
Experiments with uncertainty in computer science
Installing and setting up R
Using R packages
RSADBE – the books R package
Python installation and setup
Using pip for packages
IDEs for R and Python
The companion code bundle
Discrete distributions
Discrete uniform distribution
Binomial distribution
Hypergeometric distribution
Negative binomial distribution
Poisson distribution
Continuous distributions
Uniform distribution
Exponential distribution
Normal distribution
Summary
Chapter 2: Import/Export Data
Packages and settings – R and Python
Understanding data.frame and other formats
Constants, vectors, and matrices
Time for action – understanding constants, vectors, and basic arithmetic
What just happened?
Doing it in Python
Time for action – matrix computations
What just happened?
Doing it in Python
The list object
Time for action – creating a list object
What just happened?
The data.frame object
Time for action – creating a data.frame object
What just happened?
Have a go hero
The table object
Time for action – creating the Titanic dataset as a table object
What just happened?
Have a go hero
Using utils and the foreign packages
Time for action – importing data from external files
What just happened?
Doing it in Python
Importing data from MySQL
Doing it in Python
Exporting data/graphs
Exporting R objects
Exporting graphs
Time for action – exporting a graph
What just happened?
Managing R sessions
Time for action – session management
What just happened?
Doing it in Python
Pop quiz
Summary
Chapter 3: Data Visualization
Packages and settings – R and Python
Visualization techniques for categorical data
Bar chart
Going through the built-in examples of R
Time for action – bar charts in R
What just happened?
Doing it in Python
Have a go hero
Dot chart
Time for action – dot charts in R
What just happened?
Doing it in Python
Spine and mosaic plots
Time for action – spine plot for the shift and operator data
What just happened?
Time for action – mosaic plot for the Titanic dataset
What just happened?
Pie chart and the fourfold plot
Visualization techniques for continuous variable data
Boxplot
Time for action – using the boxplot
What just happened?
Doing it in Python
Histogram
Time for action – understanding the effectiveness of histograms
What just happened?
Doing it in Python
Have a go hero
Scatter plot
Time for action – plot and pairs R functions
What just happened?
Doing it in Python
Have a go hero
Pareto chart
A brief peek at ggplot2
Time for action – qplot
What just happened?
Time for action – ggplot
What just happened?
Pop quiz
Summary
Chapter 4: Exploratory Analysis
Packages and settings – R and Python
Essential summary statistics
Percentiles, quantiles, and median
Hinges
Interquartile range
Time for action – the essential summary statistics for The Wall dataset
What just happened?
Techniques for exploratory analysis
The stem-and-leaf plot
Time for action – the stem function in play
What just happened?
Letter values
Data re-expression
Have a go hero
Bagplot – a bivariate boxplot
Time for action – the bagplot display for multivariate datasets
What just happened?
Resistant line
Time for action – resistant line as a first regression model
What just happened?
Smoothing data
Time for action – smoothening the cow temperature data
What just happened?
Median polish
Time for action – the median polish algorithm
What just happened?
Have a go hero
Summary
Chapter 5: Statistical Inference
Packages and settings – R and Python
Maximum likelihood estimator
Visualizing the likelihood function
Time for action – visualizing the likelihood function
What just happened?
Doing it in Python
Finding the maximum likelihood estimator
Using the fitdistr function
Time for action – finding the MLE using mle and fitdistr functions
What just happened?
Confidence intervals
Time for action – confidence intervals
What just happened?
Doing it in Python
Hypothesis testing
Binomial test
Time for action – testing probability of success
What just happened?
Tests of proportions and the chi-square test
Time for action – testing proportions
What just happened?
Tests based on normal distribution – one sample
Time for action – testing one-sample hypotheses
What just happened?
Have a go hero
Tests based on normal distribution – two sample
Time for action – testing two-sample hypotheses
What just happened?
Have a go hero
Doing it in Python
Summary
Chapter 6: Linear Regression Analysis
Packages and settings - R and Python
The essence of regression
The simple linear regression model
What happens to the arbitrary choice of parameters?
Time for action - the arbitrary choice of parameters
What just happened?
Building a simple linear regression model
Time for action - building a simple linear regression model
What just happened?
Have a go hero
ANOVA and the confidence intervals
Time for action - ANOVA and the confidence intervals
What just happened?
Model validation
Time for action - residual plots for model validation
What just happened?
Doing it in Python
Have a go hero
Multiple linear regression model
Averaging k simple linear regression models or a multiple linear regression model
Time for action - averaging k simple linear regression models
What just happened?
Building a multiple linear regression model
Time for action - building a multiple linear regression model
What just happened?
The ANOVA and confidence intervals for the multiple linear regression model
Time for action - the ANOVA and confidence intervals for the multiple linear regression model
What just happened?
Have a go hero
Useful residual plots
Time for action - residual plots for the multiple linear regression model
What just happened?
Regression diagnostics
Leverage points
Influential points
DFFITS and DFBETAS
The multicollinearity problem
Time for action - addressing the multicollinearity problem for the gasoline data
What just happened?
Doing it in Python
Model selection
Stepwise procedures
The backward elimination
The forward selection
The stepwise regression
Criterion-based procedures
Time for action - model selection using the backward, forward, and AIC criteria
What just happened?
Have a go hero
Summary
Chapter 7: Logistic Regression Model
Packages and settings – R and Python
The binary regression problem
Time for action – limitation of linear regression model
What just happened?
Probit regression model
Time for action – understanding the constants
What just happened?
Doing it in Python
Logistic regression model
Time for action – fitting the logistic regression model
What just happened?
Doing it in Python
Hosmer-Lemeshow goodness-of-fit test statistic
Time for action – Hosmer-Lemeshow goodness-of-fit statistic
What just happened?
Model validation and diagnostics
Residual plots for the GLM
Time for action – residual plots for logistic regression model
What just happened?
Doing it in Python
Have a go hero
Influence and leverage for the GLM
Time for action – diagnostics for the logistic regression
What just happened?
Have a go hero
Receiving operator curves
Time for action – ROC construction
What just happened?
Doing it in Python
Logistic regression for the German credit screening dataset
Time for action – logistic regression for the German credit dataset
What just happened?
Doing it in Python
Have a go hero
Summary
Chapter 8: Regression Models with Regularization
Packages and settings – R and Python
The overfitting problem
Time for action – understanding overfitting
What just happened?
Doing it in Python
Have a go hero
Regression spline
Basis functions
Piecewise linear regression model
Time for action – fitting piecewise linear regression models
What just happened?
Natural cubic splines and the general B-splines
Time for action – fitting the spline regression models
What just happened?
Ridge regression for linear models
Protecting against overfitting
Time for action – ridge regression for the linear regression model
What just happened?
Doing it in Python
Ridge regression for logistic regression models
Time for action – ridge regression for the logistic regression model
What just happened?
Another look at model assessment
Time for action – selecting  iteratively and other topics
What just happened?
Pop quiz
Summary
Chapter 9: Classification and Regression Trees
Packages and settings – R and Python
Understanding recursive partitions
Time for action – partitioning the display plot
What just happened?
Splitting the data
The first tree
Time for action – building our first tree
What just happened?
Constructing a regression tree
Time for action – the construction of a regression tree
What just happened?
Constructing a classification tree
Time for action – the construction of a classification tree
What just happened?
Doing it in Python
Classification tree for the German credit data
Time for action – the construction of a classification tree
What just happened?
Doing it in Python
Have a go hero
Pruning and other finer aspects of a tree
Time for action – pruning a classification tree
What just happened?
Pop quiz
Summary
Chapter 10: CART and Beyond
Packages and settings – R and Python
Improving the CART
Time for action – cross-validation predictions
What just happened?
Understanding bagging
The bootstrap
Time for action – understanding the bootstrap technique
What just happened?
How the bagging algorithm works
Time for action – the bagging algorithm
What just happened?
Doing it in Python
Random forests
Time for action – random forests for the German credit data
What just happened?
Doing it in Python
The consolidation
Time for action – random forests for the low birth weight data
What just happened?
Summary
Index