This book illustrates how data can be useful in solving business problems. It explores various analytics techniques for using data to discover hidden patterns and relationships, predict future outcomes, optimize efficiency and improve the performance of organizations. You’ll learn how to analyze data by applying concepts of statistics, probability theory, and linear algebra. In this new edition, both R and Python are used to demonstrate these analyses. Practical Business Analytics Using R and Python also features new chapters covering databases, SQL, Neural networks, Text Analytics, and Natural Language Processing.
Part one begins with an introduction to analytics, the foundations required to perform data analytics, and explains different analytics terms and concepts such as databases and SQL, basic statistics, probability theory, and data exploration. Part two introduces predictive models using statistical machine learning and discusses concepts like regression, classification, and neural networks. Part three covers two of the most popular unsupervised learning techniques, clustering and association mining, as well as text mining and natural language processing (NLP). The book concludes with an overview of big data analytics, R and Python essentials for analytics including libraries such as pandas and NumPy.
Upon completing this book, you will understand how to improve business outcomes by leveraging R and Python for data analytics.
What You Will Learn
Master the mathematical foundations required for business analytics
Understand various analytics models and data mining techniques such as regression, supervised machine learning algorithms for modeling, unsupervised modeling techniques, and how to choose the correct algorithm for analysis in any given task
Use R and Python to develop descriptive models, predictive models, and optimize models
Interpret and recommend actions based on analytical model outcomes
Who This Book Is For
Software professionals and developers, managers, and executives who want to understand and learn the fundamentals of analytics using R and Python.
Author(s): Umesh R. Hodeghatta, Ph.D Umesha Nayak
Edition: 2
Publisher: Apress
Year: 2023
Language: English
Pages: 711
Table of Contents
About the Authors
Preface
Foreword
Chapter 1: An Overview of Business Analytics
1.1 Introduction
1.2 Objectives of This Book
1.3 Confusing Terminology
1.4 Drivers for Business Analytics
1.4.1 Growth of Computer Packages and Applications
1.4.2 Feasibility to Consolidate Data from Various Sources
1.4.3 Growth of Infinite Storage and Computing Capability
1.4.4 Survival and Growth in the Highly Competitive World
1.4.5 Business Complexity Growing Out of Globalization
1.4.6 Easy-to-Use Programming Tools and Platforms
1.5 Applications of Business Analytics
1.5.1 Marketing and Sales
1.5.2 Human Resources
1.5.3 Product Design
1.5.4 Service Design
1.5.5 Customer Service and Support Areas
1.6 Skills Required for an Analytics Job
1.7 Process of an Analytics Project
1.8 Chapter Summary
Chapter 2: The Foundations of Business Analytics
2.1 Introduction
2.2 Population and Sample
2.2.1 Population
2.2.2 Sample
2.3 Statistical Parameters of Interest
2.3.1 Mean
2.3.2 Median
2.3.3 Mode
2.3.4 Range
2.3.5 Quantiles
2.3.6 Standard Deviation
2.3.7 Variance
2.3.8 Summary Command in R
2.4 Probability
2.4.1 Rules of Probability
2.4.1.1 Probability of Mutually Exclusive Events
2.4.1.2 Probability of Mutually Nonexclusive Events
2.4.1.3 Probability of Mutually Independent Events
2.4.1.4 The Probability of the Complement
2.4.2 Probability Distributions
2.4.2.1 Normal Distribution
2.4.2.2 Binomial Distribution
2.4.2.3 Poisson Distribution
2.4.3 Conditional Probability
2.5 Computations on Data Frames
2.6 Scatter Plot
2.7 Chapter Summary
Chapter 3: Structured Query Language Analytics
3.1 Introduction
3.2 Data Used by Us
3.3 Steps for Business Analytics
3.3.1 Initial Exploration and Understanding of the Data
3.3.2 Understanding Incorrect and Missing Data, and Correcting Such Data
3.3.3 Further Exploration and Reporting on the Data
3.3.3.1 Additional Examples of the Useful SELECT Statements
3.4 Chapter Summary
Chapter 4: Business Analytics Process
4.1 Business Analytics Life Cycle
4.1.1 Phase 1: Understand the Business Problem
4.1.2 Phase 2: Data Collection
4.1.2.1 Sampling
4.1.3 Phase 3: Data Preprocessing and Preparation
4.1.3.1 Data Types
4.1.3.2 Data Preparation
Handling Missing Values
Handling Duplicates, Junk, and Null Values
4.1.3.3 Data Transformation
Normalization
4.1.4 Phase 4: Explore and Visualize the Data
4.1.5 Phase 5: Choose Modeling Techniques and Algorithms
4.1.5.1 Descriptive Analytics
4.1.5.2 Predictive Analytics
4.1.5.3 Machine Learning
Supervised Machine Learning
Unsupervised Machine Learning
4.1.6 Phase 6: Evaluate the Model
4.1.7 Phase 7: Report to Management and Review
4.1.7.1 Problem Description
4.1.7.2 Data Set Used
4.1.7.3 Data Cleaning Steps Carried Out
4.1.7.4 Method Used to Create the Model
4.1.7.5 Model Deployment Prerequisites
4.1.7.6 Model Deployment and Usage
4.1.7.7 Handling Production Problems
4.1.8 Phase 8: Deploy the Model
4.2 Chapter Summary
Chapter 5: Exploratory Data Analysis
5.1 Exploring and Visualizing the Data
5.1.1 Tables
5.1.2 Describing Data: Summary Tables
5.1.3 Graphs
5.1.3.1 Histogram
5.1.3.2 Box Plots
Parts of Box Plots
Box Plots Using Python
5.1.3.3 Bivariate Analysis
5.1.3.4 Scatter Plots
5.1.4 Scatter Plot Matrices
5.1.4.1 Correlation Plot
5.1.4.2 Density Plots
5.2 Plotting Categorical Data
5.3 Chapter Summary
Chapter 6: Evaluating Analytics Model Performance
6.1 Introduction
6.2 Regression Model Evaluation
6.2.1 Root-Mean-Square Error
6.2.2 Mean Absolute Percentage Error
6.2.3 Mean Absolute Error (MAE) or Mean Absolute Deviation (MAD)
6.2.4 Sum of Squared Errors (SSE)
6.2.5 R2 (R-Squared)
6.2.6 Adjusted R2
6.3 Classification Model Evaluation
6.3.1 Classification Error Matrix
6.3.2 Sensitivity Analysis in Classification
6.4 ROC Chart
6.5 Overfitting and Underfitting
6.5.1 Bias and Variance
6.6 Cross-Validation
6.7 Measuring the Performance of Clustering
6.8 Chapter Summary
Chapter 7: Simple Linear Regression
7.1 Introduction
7.2 Correlation
7.2.1 Correlation Coefficient
7.3 Hypothesis Testing
7.4 Simple Linear Regression
7.4.1 Assumptions of Regression
7.4.2 Simple Linear Regression Equation
7.4.3 Creating a Simple Regression Equation in R
7.4.4 Testing the Assumptions of Regression
7.4.4.1 Test of Linearity
7.4.4.2 Test of Independence of Errors Around the Regression Line
7.4.4.3 Test of Normality
7.4.4.4 Equal Variance of the Distribution of the Response Variable
7.4.4.5 Other Ways of Validating the Assumptions to Be Fulfilled by a Regression Model
Using the gvlma Library
Using the Scale-Location Plot
Using the crPlots(model name) Function from library(car)
7.4.5 Conclusion
7.4.6 Predicting the Response Variable
7.4.7 Additional Notes
7.5 Using Python to Generate the Model and Validating the Assumptions
7.5.1 Load Important Packages and Import the Data
7.5.2 Generate a Simple Linear Regression Model
7.5.3 Alternative Way for Generation of the Model
7.5.4 Validation of the Significance of the Generated Model
7.5.5 Validating the Assumptions of Linear Regression
7.5.6 Predict Using the Model Generated
7.6 Chapter Summary
Chapter 8: Multiple Linear Regression
8.1 Using Multiple Linear Regression
8.1.1 The Data
8.1.2 Correlation
8.1.3 Arriving at the Model
8.1.4 Validation of the Assumptions of Regression
8.1.5 Multicollinearity
8.1.6 Stepwise Multiple Linear Regression
8.1.7 All Subsets Approach to Multiple Linear Regression
8.1.8 Multiple Linear Regression Equation
8.1.9 Conclusion
8.2 Using an Alternative Method in R
8.3 Predicting the Response Variable
8.4 Training and Testing the Model
8.5 Cross Validation
8.6 Using Python to Generate the Model and Validating the Assumptions
8.6.1 Load the Necessary Packages and Import the Data
8.6.2 Generate Multiple Linear Regression Model
8.6.3 Alternative Way to Generate the Model
8.6.4 Validating the Assumptions of Linear Regression
8.6.5 Predict Using the Model Generated
8.7 Chapter Summary
Chapter 9: Classification
9.1 What Are Classification and Prediction?
9.1.1 K-Nearest Neighbor
9.1.2 KNN Algorithm
9.1.3 KNN Using R
9.1.4 KNN Using Python
9.2 Naïve Bayes Models for Classification
9.2.1 Naïve Bayes Classifier Model Example
9.2.2 Naïve Bayes Classifier Using R (Use Same Data Set as KNN)
9.2.3 Advantages and Limitations of the Naïve Bayes Classifier
9.3 Decision Trees
9.3.1 Decision Tree Algorithm
9.3.1.1 Entropy
9.3.1.2 Information Gain
9.3.2 Building a Decision Tree
9.3.3 Classification Rules from Tree
9.3.3.1 Limiting Tree Growth and Pruning the Tree
9.4 Advantages and Disadvantages of Decision Trees
9.5 Ensemble Methods and Random Forests
9.6 Decision Tree Model Using R
9.7 Decision Tree Model Using Python
9.7.1 Creating the Decision Tree Model
9.7.2 Making Predictions
9.7.3 Measuring the Accuracy of the Model
9.7.4 Creating a Pruned Tree
9.8 Chapter Summary
Chapter 10: Neural Networks
10.1 What Is an Artificial Neural Network?
10.2 Concept and Structure of Neural Networks
10.2.1 Perceptrons
10.2.2 The Architecture of Neural Networks
10.3 Learning Algorithms
10.3.1 Predicting Attrition Using a Neural Network
10.3.2 Classification and Prediction Using a Neural Network
10.3.3 Training the Model
10.3.4 Backpropagation
10.4 Activation Functions
10.4.1 Linear Function
10.4.2 Sigmoid Activation Function
10.4.3 Tanh Function
10.4.4 ReLU Activation Function
10.4.5 Softmax Activation Function
10.4.6 Selecting an Activation Function
10.5 Practical Example of Predicting Using a Neural Network
10.5.1 Implementing a Neural Network Model Using R
10.5.1.1 Exploring Data
10.5.1.2 Preprocessing Data
10.5.1.3 Preparing the Train and Test Data
10.5.1.4 Creating a Neural Network Model Using the Neuralnet() Package
10.5.1.5 Predicting Test Data
10.5.1.6 Summary Report
10.5.1.7 Model Sensitivity Analysis and Performance
10.5.1.8 ROC and AUC
10.6 Implementation of a Neural Network Model Using Python
10.7 Strengths and Weaknesses of Neural Network Models
10.8 Deep Learning and Neural Networks
10.9 Chapter Summary
Chapter 11: Logistic Regression
11.1 Logistic Regression
11.1.1 The Data
11.1.2 Creating the Model
11.1.3 Model Fit Verification
11.1.4 General Words of Caution
11.1.5 Multicollinearity
11.1.6 Dispersion
11.1.7 Conclusion for Logistic Regression
11.2 Training and Testing the Model
11.2.1 Example of Prediction
11.2.2 Validating the Logistic Regression Model on Test Data
11.3 Multinomial Logistic Regression
11.4 Regularization
11.5 Using Python to Generate Logistic Regression
11.5.1 Loading the Required Packages and Importing the Data
11.5.2 Understanding the Dataframe
11.5.3 Getting the Data Ready for the Generation of the Logistic Regression Model
11.5.4 Splitting the Data into Training Data and Test Data
11.5.5 Generating the Logistic Regression Model
11.5.6 Predicting the Test Data
11.5.7 Fine-Tuning the Logistic Regression Model
11.5.8 Logistic Regression Model Using the statsmodel() Library
11.6 Chapter Summary
Chapter 12: Time Series: Forecasting
12.1 Introduction
12.2 Characteristics of Time-Series Data
12.3 Decomposition of a Time Series
12.4 Important Forecasting Models
12.4.1 Exponential Forecasting Models
12.4.2 ARMA and ARIMA Forecasting Models
12.4.3 Assumptions for ARMA and ARIMA
12.5 Forecasting in Python
12.5.1 Loading the Base Packages
12.5.2 Reading the Time-Series Data and Creating a Dataframe
12.5.3 Trying to Understand the Data in More Detail
12.5.4 Decomposition of the Time Series
12.5.5 Test Whether the Time Series Is “Stationary”
12.5.6 The Process of “Differencing”
12.5.7 Model Generation
12.5.8 ACF and PACF Plots to Check the Model Hyperparameters and the Residuals
12.5.9 Forecasting
12.6 Chapter Summary
Chapter 13: Cluster Analysis
13.1 Overview of Clustering
13.1.1 Distance Measure
13.1.2 Euclidean Distance
13.1.3 Manhattan Distance
13.1.4 Distance Measures for Categorical Variables
13.2 Distance Between Two Clusters
13.3 Types of Clustering
13.3.1 Hierarchical Clustering
13.3.2 Dendrograms
13.3.3 Nonhierarchical Method
13.3.4 K-Means Algorithm
13.3.5 Other Clustering Methods
13.3.6 Evaluating Clustering
13.4 Limitations of Clustering
13.5 Clustering Using R
13.5.1 Hierarchical Clustering Using R
13.6 Clustering Using Python sklearn()
13.7 Chapter Summary
Chapter 14: Relationship Data Mining
14.1 Introduction
14.2 Metrics to Measure Association: Support, Confidence, and Lift
14.2.1 Support
14.2.2 Confidence
14.2.3 Lift
14.3 Generating Association Rules
14.4 Association Rule (Market Basket Analysis) Using R
14.5 Association Rule (Market Basket Analysis) Using Python
14.6 Chapter Summary
Chapter 15: Introduction to Natural Language Processing
15.1 Overview
15.2 Applications of NLP
15.2.1 Chatbots
15.2.2 Sentiment Analysis
15.2.3 Machine Translation
15.3 What Is Language?
15.3.1 Phonemes
15.3.2 Lexeme
15.3.3 Morpheme
15.3.4 Syntax
15.3.5 Context
15.4 What Is Natural Language Processing?
15.4.1 Why Is NLP Challenging?
15.5 Approaches to NLP
15.5.1 WordNet Corpus
15.5.2 Brown Corpus
15.5.3 Reuters Corpus
15.5.4 Processing Text Using Regular Expressions
15.5.4.1 re.search() Method
15.5.4.2 re.findall()
15.5.4.3 re.sub()
15.6 Important NLP Python Libraries
15.7 Important NLP R Libraries
15.8 NLP Tasks Using Python
15.8.1 Text Normalization
15.8.2 Tokenization
15.8.3 Lemmatization
15.8.4 Stemming
15.8.5 Stop Word Removal
15.8.6 Part-of-Speech Tagging
15.8.7 Probabilistic Language Model
15.8.8 N-gram Language Model
15.9 Representing Words as Vectors
15.9.1 Bag-of-Words Modeling
15.9.2 TF-IDF Vectors
15.9.3 Term Frequency
15.9.4 Inverse Document Frequency
15.9.5 TF-IDF
15.10 Text Classifications
15.11 Word2vec Models
15.12 Text Analytics and NLP
15.13 Deep Learning and NLP
15.14 Case Study: Building a Chatbot
15.15 Chapter Summary
Chapter 16: Big Data Analytics and Future Trends
16.1 Introduction
16.2 Big Data Ecosystem
16.3 Future Trends in Big Data Analytics
16.3.1 Growth of Social Media
16.3.2 Creation of Data Lakes
16.3.3 Visualization Tools at the Hands of Business Users
16.3.4 Prescriptive Analytics
16.3.5 Internet of Things
16.3.6 Artificial Intelligence
16.3.7 Whole Data Processing
16.3.8 Vertical and Horizontal Applications
16.3.9 Real-Time Analytics
16.4 Putting the Analytics in the Hands of Business Users
16.5 Migration of Solutions from One Tool to Another
16.6 Cloud Analytics
16.7 In-Database Analytics
16.8 In-Memory Analytics
16.9 Autonomous Services for Machine Learning
16.10 Addressing Security and Compliance
16.11 Big data Applications
16.12 Chapter Summary
Chapter 17: R for Analytics
17.1 Data Analytics Tools
17.2 Data Wrangling and Data Preprocessing Using R
17.2.1 Handling NAs and NULL Values in the Data Set
17.2.2 Apply() Functions in R
17.2.3 lapply()
17.2.4 sapply()
17.3 Removing Duplicate Records in the Data Set
17.4 split()
17.5 Writing Your Own Functions in R
17.6 Chapter Summary
Chapter 18: Python Programming for Analytics
18.1 Introduction
18.2 pandas for Data Analytics
18.2.1 Data Slicing Using pandas
18.2.2 Statistical Data Analysis Using pandas
18.2.3 Pandas Database Functions
18.2.4 Data Preprocessing Using pandas
18.2.5 Handling Data Types
18.2.6 Handling Dates Variables
18.2.7 Feature Engineering
18.2.8 Data Preprocessing Using the apply() Function
18.2.9 Plots Using pandas
18.3 NumPy for Data Analytics
18.3.1 Creating NumPy Arrays with Zeros and Ones
18.3.2 Random Number Generation and Statistical Analysis
18.3.3 Indexing, Slicing, and Iterating
18.3.4 Stacking Two Arrays
18.4 Chapter Summary
References
Dataset CITATION
Index
Capture.PNG
Capture.PNG
Capture.PNG