Learn Data Mining by doing data mining Data mining can be revolutionary-but only when it's done right. The powerful black box data mining software now available can produce disastrously misleading results unless applied by a skilled and knowledgeable analyst. Discovering Knowledge in Data: An Introduction to Data Mining provides both the practical experience and the theoretical insight needed to reveal valuable information hidden in large data sets. Employing a "white box" methodology and with real-world case studies, this step-by-step guide walks readers through the various algorithms and statistical structures that underlie the software and presents examples of their operation on actual large data sets. Principal topics include: * Data preprocessing and classification * Exploratory analysis * Decision trees * Neural and Kohonen networks * Hierarchical and k-means clustering * Association rules * Model evaluation techniques Complete with scores of screenshots and diagrams to encourage graphical learning, Discovering Knowledge in Data: An Introduction to Data Mining gives students in Business, Computer Science, and Statistics as well as professionals in the field the power to turn any data warehouse into actionable knowledge. An Instructor's Manual presenting detailed solutions to all the problems in the book is available online.
Author(s): Larose D
Publisher: Wiley
Year: 2005
Language: English
Pages: 241
Team DDU......Page 1
CONTENTS......Page 10
PREFACE......Page 14
1 INTRODUCTION TO DATA MINING......Page 20
What Is Data Mining?......Page 21
Need for Human Direction of Data Mining......Page 23
Cross-Industry Standard Process: CRISP–DM......Page 24
Case Study 1: Analyzing Automobile Warranty Claims: Example of the CRISP–DM Industry Standard Process in Action......Page 27
Fallacies of Data Mining......Page 29
Description......Page 30
Estimation......Page 31
Prediction......Page 32
Classification......Page 33
Clustering......Page 35
Association......Page 36
Case Study 2: Predicting Abnormal Stock Market Returns Using Neural Networks......Page 37
Case Study 3: Mining Association Rules from Legal Databases......Page 38
Case Study 4: Predicting Corporate Bankruptcies Using Decision Trees......Page 40
Case Study 5: Profiling the Tourism Market Using k-Means Clustering Analysis......Page 42
References......Page 43
Exercises......Page 44
Why Do We Need to Preprocess the Data?......Page 46
Data Cleaning......Page 47
Handling Missing Data......Page 49
Identifying Misclassifications......Page 52
Graphical Methods for Identifying Outliers......Page 53
Data Transformation......Page 54
Min–Max Normalization......Page 55
Z-Score Standardization......Page 56
Numerical Methods for Identifying Outliers......Page 57
Exercises......Page 58
Hypothesis Testing versus Exploratory Data Analysis......Page 60
Getting to Know the Data Set......Page 61
Dealing with Correlated Variables......Page 63
Exploring Categorical Variables......Page 64
Using EDA to Uncover Anomalous Fields......Page 69
Exploring Numerical Variables......Page 71
Exploring Multivariate Relationships......Page 78
Selecting Interesting Subsets of the Data for Further Investigation......Page 80
Binning......Page 81
Summary......Page 82
Exercises......Page 83
Data Mining Tasks in Discovering Knowledge in Data......Page 86
Statistical Approaches to Estimation and Prediction......Page 87
Univariate Methods: Measures of Center and Spread......Page 88
Statistical Inference......Page 90
Confidence Interval Estimation......Page 92
Bivariate Methods: Simple Linear Regression......Page 94
Dangers of Extrapolation......Page 98
Prediction Intervals for a Randomly Chosen Value of y Given x......Page 99
Multiple Regression......Page 102
Verifying Model Assumptions......Page 104
Exercises......Page 107
Supervised versus Unsupervised Methods......Page 109
Methodology for Supervised Modeling......Page 110
Bias–Variance Trade-Off......Page 112
Classification Task......Page 114
k-Nearest Neighbor Algorithm......Page 115
Distance Function......Page 118
Simple Unweighted Voting......Page 120
Weighted Voting......Page 121
Quantifying Attribute Relevance: Stretching the Axes......Page 122
k-Nearest Neighbor Algorithm for Estimation and Prediction......Page 123
Choosing k......Page 124
Exercises......Page 125
6 DECISION TREES......Page 126
Classification and Regression Trees......Page 128
C4.5 Algorithm......Page 135
Decision Rules......Page 140
Comparison of the C5.0 and CART Algorithms Applied to Real Data......Page 141
Exercises......Page 145
7 NEURAL NETWORKS......Page 147
Input and Output Encoding......Page 148
Simple Example of a Neural Network......Page 150
Sigmoid Activation Function......Page 153
Gradient Descent Method......Page 154
Back-Propagation Rules......Page 155
Example of Back-Propagation......Page 156
Learning Rate......Page 158
Momentum Term......Page 159
Sensitivity Analysis......Page 161
Application of Neural Network Modeling......Page 162
Exercises......Page 164
Clustering Task......Page 166
Hierarchical Clustering Methods......Page 168
Single-Linkage Clustering......Page 169
Complete-Linkage Clustering......Page 170
Example of k-Means Clustering at Work......Page 172
Application of k-Means Clustering Using SAS Enterprise Miner......Page 177
References......Page 180
Exercises......Page 181
Self-Organizing Maps......Page 182
Kohonen Networks......Page 184
Example of a Kohonen Network Study......Page 185
Application of Clustering Using Kohonen Networks......Page 189
Interpreting the Clusters......Page 190
Cluster Profiles......Page 194
Using Cluster Membership as Input to Downstream Data Mining Models......Page 196
Exercises......Page 197
Affinity Analysis and Market Basket Analysis......Page 199
Data Representation for Market Basket Analysis......Page 201
Support, Confidence, Frequent Itemsets, and the A Priori Property......Page 202
How Does the A Priori AlgorithmWork (Part 1)? Generating Frequent Itemsets......Page 204
How Does the A Priori AlgorithmWork (Part 2)? Generating Association Rules......Page 205
Extension from Flag Data to General Categorical Data......Page 208
J-Measure......Page 209
Application of Generalized Rule Induction......Page 210
When Not to Use Association Rules......Page 212
Do Association Rules Represent Supervised or Unsupervised Learning?......Page 215
Local Patterns versus Global Models......Page 216
Exercises......Page 217
11 MODEL EVALUATION TECHNIQUES......Page 219
Model Evaluation Techniques for the Estimation and Prediction Tasks......Page 220
Error Rate, False Positives, and False Negatives......Page 222
Misclassification Cost Adjustment to Reflect Real-World Concerns......Page 224
Decision Cost/Benefit Analysis......Page 226
Lift Charts and Gains Charts......Page 227
Interweaving Model Evaluation with Model Building......Page 230
Confluence of Results: Applying a Suite of Models......Page 231
Exercises......Page 232
EPILOGUE: "WE'VE ONLY JUST BEGUN"......Page 234
INDEX......Page 236