Most books on data mining focus on principles and furnish few instructions on how to carry out a data mining project. Data Mining Using SAS Applications not only introduces the key concepts but also enables readers to understand and successfully apply data mining methods using powerful yet user-friendly SAS macro-call files. These methods stress the use of visualization to thoroughly study the structure of data and check the validity of statistical models fitted to data."Learn how to convert PC databases to SAS data"Discover sampling techniques to create training and validation samples"Understand frequency data analysis for categorical data"Explore supervised and unsupervised learning"Master exploratory graphical techniques"Acquire model validation techniques in regression and classificationThe text furnishes 13 easy-to-use SAS data mining macros designed to work with the standard SAS modules. No additional modules or previous experience in SAS programming is required. The author shows how to perform complete predictive modeling, including data exploration, model fitting, assumption checks, validation, and scoring new data, on SAS datasets in less than ten minutes!
Author(s): George Fernandez
Series: Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
Publisher: CRC
Year: 2002
Language: English
Pages: 361
Data Mining Using SAS Applications......Page 1
Why Use SAS Software?......Page 3
Coverage......Page 4
Key Features of the Book......Page 5
Additional Resources......Page 6
Acknowledgments......Page 8
Contents......Page 10
1.1 Introduction......Page 13
1.2.2 Price Drop in Data Storage and Efficient Computer Processing......Page 14
1.4 Data Mining: Users......Page 15
1.6.2 Data Processing......Page 17
1.6.3 Data Exploration and Descriptive Analysis......Page 18
1.6.5 Data Mining Solutions: Supervised Learning Methods......Page 19
1.7 Problems in the Data Mining Process......Page 21
1.8.1 SEMMA: The SAS Data Mining Process......Page 22
1.9 User-Friendly SAS Macros for Data Mining......Page 23
References......Page 24
Suggested Reading and Case Studies......Page 25
2.2 Data Requirements in Data Mining......Page 27
2.4 Understanding the Measurement Scale of Variables......Page 28
2.5 Entire Database vs. Representative Sample......Page 29
2.7 SAS Applications Used in Data Preparation......Page 30
2.7.1.2 Instructions for Creating SAS Dataset from Oracle Database Using SAS/ACCESS and the LIBNAME Statement......Page 31
2.7.2.1 Instructions for Converting PC Data Formats to SAS Datasets Using the SAS Import Wizard......Page 32
2.7.2.2 Converting PC Data Formats to SAS Datasets Using the EXCELSAS Macro......Page 34
2.7.2.3 Steps Involved in Running the EXCELSAS Macro......Page 35
2.7.2.4 Help File for SAS Macro EXCELSAS: Description of Macro Parameters......Page 36
2.7.2.5 Importing an Excel File Called “fraud” to a Permanent SAS Dataset Called “fraud”......Page 38
2.7.3 SAS Macro Applications: Random Sampling from the Entire Database Using the SAS Macro RANSPLIT......Page 39
2.7.3.1 Steps Involved in Running the RANSPLIT Macro......Page 45
2.7.3.2 Help File for SAS Macro RANSPLIT: Description of Macro Parameters......Page 46
2.7.3.3 Drawing TRAINING (400), VALIDATION (300), and TEST (All Leftover Observations) Samples from the Permanent SAS Dataset Called “fraud”......Page 49
2.8 Summary......Page 50
Suggested Reading......Page 51
3.2 Exploring Continuous Variables......Page 52
3.2.1.1 Measures of Location or Central Tendency......Page 53
3.2.1.4 Measures of Dispersion......Page 54
3.2.1.6 Detecting Deviation from Normally Distributed Data......Page 55
3.2.2 Graphical Techniques Used in EDA of Continuous Data......Page 56
3.3.2 Graphical Displays for Categorical Data......Page 59
3.4.1.1 Steps Involved in Running the FREQ Macro......Page 62
3.4.1.2 Help File for SAS Macro: FREQ, Description of Macro Parameters......Page 64
3.4.1.3 Case Study 1: Exploring Categorical Variables in a Permanent SAS Dataset gf.cars93......Page 67
3.4.2 EDA Analysis of Continuous Variables Using SAS Macro UNIVAR......Page 68
3.4.2.1 Steps Involved in Running the UNIVAR Macro......Page 70
3.4.2.2 Help File for SAS UNIVAR Macro: Description of Macro Parameters......Page 71
3.4.2.3 Case Study 2: Data Exploration and Continuous Variables......Page 74
3.4.2.4 Case Study 3: Exploring Continuous Data by a Group Variable......Page 79
References......Page 87
Suggested Reading......Page 88
4.1 Introduction......Page 89
4.2 Applications of Unsupervised Learning Methods......Page 90
4.3 Principal Component Analysis......Page 91
4.3.1 PCA Terminology......Page 92
4.4 Exploratory Factor Analysis......Page 94
4.4.1 Exploratory Factor Analysis vs. Principal Component Analysis......Page 95
4.4.2.3 Sampling Adequacy Check in Factor Analysis......Page 96
4.4.2.7 Scree/Parallel Analysis Plot......Page 97
4.4.2.10 Interpretability......Page 98
4.4.2.12 Factor Loadings......Page 99
4.4.2.13 Factor Rotation......Page 100
4.5.1.1 Hierarchical Cluster Analysis......Page 101
4.5.1.4 Optimum Number of Population Clusters......Page 102
4.6 Bi-Plot Display of PCA, EFA, and DCA Results......Page 103
4.7.1 Steps Involved in Running the FACTOR Macro......Page 104
4.7.2 Help File for SAS Macro FACTOR......Page 106
4.7.3.3 Exploratory Analysis......Page 110
4.7.4.1 Study Objectives......Page 118
4.7.4.3 Checking for Multivariate Normality Assumptions and Performing Maximum-Likelihood Factor Analysis......Page 119
4.7.4.4 Assessing the Appropriateness of Common Factor Analysis......Page 123
4.7.4.5 Determining the Number of Latent Factors......Page 125
4.7.4.6 Interpreting Common Factors......Page 127
4.7.4.7 Checking the Validity of Common Factor Analysis......Page 128
4.7.4.8 Investigating Interrelationships Between Multiple Attributes and Observations......Page 131
4.8 Disjoint Cluster Analysis Using SAS Macro DISJCLUS......Page 133
4.8.1 Steps Involved in Running the DISJCLUS Macro......Page 135
4.8.2 Help File for SAS Macro DISJCLUS......Page 136
4.8.3.3 Scatterplot Matrix of Cluster Separation, Variable Selection, and Optimum Cluster Number Estimation......Page 140
4.8.3.4 Scatterplot Matrix of Cluster Separation......Page 141
4.8.3.5 Significant Variable Selection......Page 143
4.8.3.6 Determining Optimum Number of Clusters......Page 144
4.8.3.7 Checking for Multivariate Normality......Page 146
4.8.3.10 Checking for Significant Cluster Groupings by CDA......Page 148
4.8.3.11 Bi-Plot Display of Canonical Discriminant Function Scores and the Cluster Groupings......Page 153
References......Page 158
Suggested Reading......Page 159
5.1 Introduction......Page 160
5.2 Applications of Supervised Predictive Methods......Page 161
5.3 Multiple Linear Regression Modeling......Page 162
5.3.1.2 Regression Parameter Estimates......Page 163
5.3.1.4 Significance of Regression Parameters......Page 164
5.3.1.6 Predicted and Residual Scores......Page 165
5.3.2 Exploratory Analysis Using Diagnostic Plots......Page 166
5.3.3 Model Selection......Page 168
5.3.4.2 Serial Correlation Among the Residual......Page 169
5.3.4.4 Multicollinearity......Page 170
5.3.4.6 Non-Normality of Residuals......Page 171
5.4 Binary Logistic Regression Modeling......Page 172
5.4.1 Terminology and Key Concepts......Page 173
5.4.1.2 Assessing the Model Fit......Page 175
5.4.2 Exploratory Analysis Using Diagnostic Plots......Page 176
5.4.2.1 Interpretation......Page 177
5.4.4.2 Influential Outlier......Page 178
5.4.4.4 Overdispersion33......Page 179
5.5 Multiple Linear Regression Using SAS Macro REGDIAG......Page 180
5.5.2 Help File for SAS Macro REGDIAG......Page 181
5.6 Lift Chart Using SAS Macro LIFT......Page 186
5.6.2 Help File for Using SAS Macro LIFT......Page 187
5.7 Scoring New Regression Data Using the SAS Macro RSCORE......Page 191
5.7.1 Steps Involved in Running the RSCORE Macro......Page 192
5.7.2 Help File for Using SAS Macro RSCORE......Page 193
5.8.1 Steps Involved in Running the LOGISTIC Macro......Page 195
5.8.2 Help File for SAS Macro LOGISTIC......Page 196
5.9 Scoring New Logistic Regression Data Using the SAS Macro LSCORE......Page 200
5.9.1 Steps Involved in Running the LSCORE Macro......Page 201
5.9.2 Help File for Using SAS Macro LSCORE......Page 202
5.10.1 Study Objectives......Page 204
5.10.3 Exploratory Analysis/Diagnostic Plots......Page 205
5.10.5 Variable Selection Using R2 Selection Method......Page 211
5.10.6 Checking for Model Specification Errors......Page 213
5.10.7 Regression Model Fitting......Page 216
5.10.8.1 Autocorrelation......Page 222
5.10.8.2 Significant Outlier/Influential Observations......Page 225
5.10.9 Performing If-Then Analysis and Producing the Lift Chart......Page 227
5.10.10 Predicting the Response Scores for a New Dataset......Page 228
5.11.1 Study Objectives......Page 229
5.11.2.1 Background Information......Page 231
5.11.2.2 Exploratory Analysis/Diagnostic Plots......Page 232
5.11.2.3 Fitting Regression Model and Validation......Page 235
5.11.2.4 Checking for Regression Model Violations......Page 240
5.11.2.5 Model Validation......Page 244
5.11.2.6 Performing If-Then Analysis and Producing the LIFT Chart......Page 245
5.12.1 Study Objectives......Page 247
5.12.2.1 Background Information......Page 249
5.12.2.2 Exploratory Analysis/Diagnostic Plots......Page 250
5.12.2.3 Fitting Binary Logistic Regression......Page 253
5.12.2.5 Model Validation......Page 261
5.12.2.6 Performing If-Then Analysis and Producing the LIFT Chart......Page 266
5.13 Summary......Page 267
References......Page 268
6.1 Introduction......Page 271
6.2 Discriminant Analysis......Page 272
6.3 Stepwise Discriminant Analysis......Page 273
6.4.1 Canonical Discriminant Analysis Assumptions......Page 274
6.4.2.1 Canonical Discriminant Function......Page 275
6.4.2.5 Bi-Plot Display of Canonical Discriminant Analysis......Page 276
6.5.1 Key Concepts and Terminology in Discriminant Function Analysis......Page 277
6.7 Classification Tree Based on CHAID......Page 280
6.7.1.1 Construction of Classification Trees......Page 281
6.7.1.2 Chi-Square Automatic Interaction Detection Method......Page 282
6.7.1.5 Assessing the Decision Trees......Page 283
6.9 Discriminant Analysis Using SAS Macro DISCRIM......Page 284
6.9.1 Steps Involved in Running the DISCRIM Macro......Page 285
6.9.2 Help File for SAS Macro DISCRIM......Page 286
6.10 Decision Tree Using SAS Macro CHAID......Page 291
6.10.2 Help File for SAS Macro CHAID......Page 292
6.11.1 Study Objectives......Page 295
6.11.2 Data Descriptions......Page 296
6.11.5 Variable Selection Methods......Page 297
6.11.6.1 Checking for Multivariate Normality......Page 300
6.11.7 Canonical Discriminant Analysis......Page 304
6.11.8 Discriminant Function Analysis......Page 313
6.12 Case Study 2: Nonparametric DFA......Page 320
6.12.1 Study Objectives......Page 321
6.12.2 Data Descriptions......Page 323
6.12.4 Data Exploration and Checking......Page 324
6.12.5 Discriminant Analysis and Checking for Multivariate Normality......Page 325
6.12.7 Checking for the Presence of Multivariate Outliers......Page 326
6.13 Case Study 3: Classification Tree Using CHAID......Page 338
6.13.1 Study Objectives......Page 339
6.13.2 Data Descriptions......Page 342
6.14 Summary......Page 347
References......Page 348
Suggested Reading......Page 349
7.1 Introduction......Page 350
7.2.1.1 Data Import......Page 351
7.2.1.4 Online Analytical Processing (OLAP)......Page 352
7.3 Artificial Neural Network Methods......Page 353
7.4 Market Basket Association Analysis......Page 354
7.4.1 Benefits of MBA......Page 355
7.6 Summary......Page 356
References......Page 357
Further Reading......Page 358
Internet Requirements......Page 359
Instructions for Running the SAS Macros......Page 360
Option 1: Downloadable Macros......Page 361