This book focuses on three core knowledge requirements for effective and thorough data analysis for solving business problems. These are a foundational understanding of:
1. statistical, econometric, and machine learning techniques;
2. data handling capabilities;
3. at least one programming language.
Practical in orientation, the volume offers illustrative case studies throughout and examples using Python in the context of Jupyter notebooks. Covered topics include demand measurement and forecasting, predictive modeling, pricing analytics, customer satisfaction assessment, market and advertising research, and new product development and research. This volume will be useful to business data analysts, data scientists, and market research professionals, as well as aspiring practitioners in business data analytics. It can also be used in colleges and universities offering courses and certifications in business data analytics, data science, and market research.
Author(s): Walter R. Paczkowski
Publisher: Springer
Year: 2022
Language: English
Pages: 425
City: Cham
Preface
The Book's Focus
The Target Audience
The Book's Competitive Comparison
The Book's Structure
Acknowledgments
Contents
List of Figures
List of Tables
Part I Beginning Analytics
1 Introduction to Business Data Analytics: Setting the Stage
1.1 Types of Business Problems
1.2 The Role of Information in Business Decision Making
1.3 Uncertainty vs. Risk
1.4 The Data-Information Nexus
1.4.1 Data and Information Confusion
1.4.2 The Data Component
1.4.3 The Extractor Component
1.4.3.1 Text Data
1.4.3.2 Numeric Data
1.4.3.3 Data: A Combined View
1.4.4 The Information Component
1.5 Analytics Requirements
1.5.1 Theoretical Framework
1.5.2 Data Handling
1.5.3 Programming Literacy
1.5.4 Component Interconnections
2 Data Sources, Organization, and Structures
2.1 Data Dimensions: A Taxonomy for Defining Data
2.1.1 Taxonomy Component #1: Source
2.1.2 Taxonomy Component #2: Domain
2.1.3 Taxonomy Component #3: Levels
2.1.4 Taxonomy Component #4: Continuity
2.1.5 Taxonomy Component #5: Measurement Scale
2.2 Data Organization
2.2.1 External Database Structures
2.2.2 Internal Database Structures
2.3 Data Dictionary
3 Basic Data Handling
3.1 Case Studies
3.1.1 Case Study 1: Customer Transactions Data
3.1.2 Case Study 2: Measures of Order Fulfillment
3.2 Importing Your Data
3.2.1 Data Formats
3.2.2 Importing a CSV Text File into Pandas
3.2.3 Importing Large Files in Chunks
3.2.4 Checking Your Imported Data
3.2.4.1 Check #1: Display the First Few Records
3.2.4.2 Check #2: Check the Shape of the DataFrame
3.2.4.3 Check #3: Check Column Names
3.2.4.4 Check #4: Check for Missing Values
3.2.4.5 Check #5: Check the Data Types
3.3 Merging or Joining DataFrames
3.4 Reshaping DataFrames
3.5 Sorting a DataFrame
3.6 Querying a DataFrame
3.6.1 Boolean Operators and Indicator Functions
3.6.2 Pandas Query Method
4 Data Visualization: The Basics
4.1 Background for Data Visualization
4.2 Gestalt Principles of Visual Design
4.3 Issues Complicating Data Visualization
4.3.1 Human Visual Limitations
4.3.2 Data Visualization Tools
4.3.3 Types of Visuals
4.3.4 What to Look for in a Graph
4.3.4.1 Feature #1: Distributions
4.3.4.2 Feature #2: Relationships
4.3.4.3 Feature #3: Patterns
4.3.4.4 Feature #4: Trends
4.3.4.5 Feature #5: Anomalies
4.4 Visualizing Spatial Data
4.4.1 Data Preparation
4.4.2 Visualizing Continuous Spatial Data
4.4.3 Visualizing Categorical Spatial Data
4.4.4 Visualizing Continuous and Categorical Spatial Data
4.5 Visualizing Temporal (Time Series) Data
4.5.1 Properties of Temporal (Time Series) Data
4.5.2 Visualizing Time Series Data
4.5.3 Times Series Complications
4.6 Faceted Plots
4.7 Appendix
4.7.1 Taylor Series Expansion for Growth Rates
5 Advanced Data Handling: Preprocessing Methods
5.1 Transformations
5.1.1 Linear Transformations
5.1.2 Nonlinear Transformations
5.1.3 A Family of Transformations
5.2 Encoding
5.2.1 Dummy or One-Hot Encoding
5.2.1.1 Pandas Dummy Encoding
5.2.1.2 sklearn Dummy Encoding
5.2.2 Patsy Encoding
5.2.3 Label Encoding
5.2.4 Binarizing Data
5.3 Dimension Reduction
5.4 Handling Missing Data
5.5 Appendix
5.5.1 Mean and Variance of Standardized Variable
5.5.2 Mean and Variance of Adjusted Standardized Variable
5.5.3 Unbiased Estimators of μ and σ2
Part II Intermediate Analytics
6 OLS Regression: The Basics
6.1 Basic OLS Concept
6.1.1 The Disturbance Term and the Residual
6.1.2 OLS Estimation
6.1.3 The Gauss-Markov Theorem
6.2 Analysis of Variance
6.3 Case Study
6.3.1 Basic OLS Regression
6.3.2 The Log-Log Model
6.3.3 Model Set-up
6.3.4 Estimation Summary
6.3.5 ANOVA for Basic Regression
6.3.6 Elasticities
6.4 Basic Multiple Regression
6.4.1 ANOVA for Multiple Regression
6.4.2 Alternative Measures of Fit: AIC and BIC
6.5 Case Study: Expanded Analysis
6.6 Model Portfolio
6.7 Predictive Analysis: Introduction
6.7.1 Predicting vs. Forecasting
6.7.2 Developing a Prediction
6.7.3 Simulation Tool for Prediction Application
7 Time Series Analysis
7.1 Time Series Basics
7.1.1 Time Series Definition
7.1.2 Time Series Concepts
7.2 Importing a Date/Time Variable
7.3 The Data Cube and Time Series Data
7.4 Handling Dates and Times in Python and Pandas
7.4.1 Datetimes vs. Periods
7.4.2 Aggregating Datetime Measures
7.4.3 Converting Time Periods in Pandas
7.4.4 Date-Time Mini-Language
7.5 Some Calendrical Calculations
7.6 Time Series Generation Process: AR(1) Model
7.7 Visualization for AR(1) Detection
7.8 Durbin-Watson Test Statistic
7.9 Lagged Dependent and Independent Variables
7.9.1 Lagged Independent Variable: ARDL(0, 1)
7.9.2 Lagged Dependent Variable: ARDL(1, 0)
7.9.3 Lagged Dependent and Independent Variables:ARDL(1, 1)
7.10 Further Exploration of Time Series Analysis
7.10.1 Step 1: Identification of a Model
7.10.1.1 AR(p) Model
7.10.1.2 MA(q) Model
7.10.1.3 ARMA(p, q) Model
7.10.1.4 ARIMA(p, d, q) Model
7.10.1.5 Digression: Time Series Stationarity—An Overview
7.10.2 Step 2: Estimation of the Model
7.10.3 Step 3: Validation of the Model
7.10.4 Step 4: Forecasting with the Model
7.11 Appendix
7.11.1 Backshift Operator
7.11.2 Useful Algebra Results
7.11.3 Mean and Variance of Yt
7.11.4 Demeaned Data
7.11.5 Time Trend Addition
8 Statistical Tables
8.1 Data Preprocessing
8.2 Categorical Data
8.3 Creating a Frequency Table
8.4 Hypothesis Testing: A First Step
8.5 Cross-tabs and Hypothesis Tests
8.5.1 Hypothesis Testing
8.5.2 Plotting a Frequency Table
8.6 Extending the Cross-tab
8.7 Pivot Tables
8.8 Appendix
8.8.1 Pearson Chi-Square Statistic
Part III Advanced Analytics
9 Advanced Data Handling for Business Data Analytics
9.1 Supervised and Unsupervised Learning
9.2 Working with the Data Cube
9.3 The Data Cube and DataFrame Indexing
9.4 Sampling From a DataFrame
9.4.1 Simple Random Sampling (SRS)
9.4.2 Stratified Random Sampling
9.4.3 Cluster Random Sampling
9.5 Index Sorting of a DataFrame
9.6 Splitting a DataFrame: The Train-Test Splits
9.6.1 Model Tuning of Hyperparameters
9.6.2 Incorrect Use of Testing Data
9.6.3 Creating the Training/Testing Data Sets
9.6.3.1 Comment on Strategy
9.6.3.2 Handling Cross-Sectional Data
9.6.3.3 Handling Time Series Data
9.6.3.4 Handling Panel Data
9.6.4 Recombining the Data Sets
9.7 Appendix
9.7.1 Primer on Random Numbers
10 Advanced OLS for Business Data Analytics
10.1 Link Functions: An Introduction
10.2 Data Preprocessing
10.2.1 Data Standardization for Regression Analysis
10.2.2 One-Hot and Effects (or Sum) Encoding
10.3 Case Study Application
10.4 Heteroskedasticity Issues and Tests
10.4.1 Heteroskedasticity Problem
10.4.2 Heteroskedasticity Detection
10.4.3 Heteroskedasticity Remedy
10.5 Multicollinearity
10.5.1 Digression on Multicollinearity
10.5.2 Detection with VIF and the Condition Index
10.5.3 Principal Component Regression and High-Dimensional Data
10.6 Predictions and Scenario Analysis
10.6.1 Making Predictions
10.6.2 Scenario Analysis
10.6.3 Prediction Error Analysis (PEA)
10.6.3.1 LOOCV Approach
10.6.3.2 k-Fold Approach
10.6.3.3 Score Measures
10.6.3.4 Variations on Validation Methods
10.6.3.5 Complexity of Testing
10.6.3.6 Examples of k-Fold Split
10.7 Panel Data Models
11 Classification with Supervised Learning Methods
11.1 Case Study: Background
11.2 Logistic Regression
11.2.1 A Choice Interpretation
11.2.2 Properties of this Problem
11.2.3 A Model for the Binary Problem
11.2.4 Case Study: Train-Test Data Split
11.2.5 Case Study: Logit Model Training
11.2.6 Making and Assessing Predictions
11.2.7 Classification with a Logit Model
11.3 K-Nearest Neighbor (KNN)
11.3.1 Case Study: Predicting
11.4 Naive Bayes
11.4.1 Background: Bayes Theorem
11.4.2 A General Statement
11.4.3 The Naive Adjective: A Simplifying Assumption
11.4.4 Distribution Assumptions
11.4.5 Case Study: Naive Bayes Training
11.5 Decision Trees for Classification
11.5.1 Partitioning by Constants
11.5.2 Gini Index and Entropy
11.5.3 Case Study: Growing a Tree
11.5.4 Case Study: Predicting with a Tree
11.5.5 Random Forests
11.6 Support Vector Machines
11.6.1 Case Study: SVC Application
11.6.2 Case Study: Prediction
11.7 Classifier Accuracy Comparison
12 Grouping with Unsupervised Learning Methods
12.1 Training and Testing Data Sets
12.2 Hierarchical Clustering
12.2.1 Forms of Hierarchical Clustering
12.2.2 Agglomerative Algorithm Description
12.2.3 Metrics and Linkages
12.2.4 Preprocessing Data
12.2.5 Case Study Application
12.2.6 Examining More than One Solution
12.3 K-Means Clustering
12.3.1 Algorithm Description
12.3.2 Case Study Application
12.4 Mixture Model Clustering
Bibliography
Index