Data mining can be revolutionary-but only when it's done right. The powerful black box data mining software now available can produce disastrously misleading results unless applied by a skilled and knowledgeable analyst. Discovering Knowledge in Data: An Introduction to Data Mining provides both the practical experience and the theoretical insight needed to reveal valuable information hidden in large data sets.Employing a "white box" methodology and with real-world case studies, this step-by-step guide walks readers through the various algorithms and statistical structures that underlie the software and presents examples of their operation on actual large data sets. Principal topics include: * Data preprocessing and classification * Exploratory analysis * Decision trees * Neural and Kohonen networks * Hierarchical and k-means clustering * Association rules * Model evaluation techniquesComplete with scores of screenshots and diagrams to encourage graphical learning, Discovering Knowledge in Data: An Introduction to Data Mining gives students in Business, Computer Science, and Statistics as well as professionals in the field the power to turn any data warehouse into actionable knowledge.
Author(s): Daniel T. Larose
Edition: 1
Publisher: Wiley-Interscience
Year: 2004
Language: English
Pages: 237
Tags: Информатика и вычислительная техника;Искусственный интеллект;Интеллектуальный анализ данных;
Team DDU......Page 1
CONTENTS......Page 7
PREFACE......Page 11
1 INTRODUCTION TO DATA MINING......Page 16
What Is Data Mining?......Page 17
Need for Human Direction of Data Mining......Page 19
Cross-Industry Standard Process: CRISP–DM......Page 20
Case Study 1: Analyzing Automobile Warranty Claims: Example of the CRISP–DM Industry Standard Process in Action......Page 23
Fallacies of Data Mining......Page 25
Description......Page 26
Estimation......Page 27
Prediction......Page 28
Classification......Page 29
Clustering......Page 31
Association......Page 32
Case Study 2: Predicting Abnormal Stock Market Returns Using Neural Networks......Page 33
Case Study 3: Mining Association Rules from Legal Databases......Page 34
Case Study 4: Predicting Corporate Bankruptcies Using Decision Trees......Page 36
Case Study 5: Profiling the Tourism Market Using k-Means Clustering Analysis......Page 38
References......Page 39
Exercises......Page 40
Why Do We Need to Preprocess the Data?......Page 42
Data Cleaning......Page 43
Handling Missing Data......Page 45
Identifying Misclassifications......Page 48
Graphical Methods for Identifying Outliers......Page 49
Data Transformation......Page 50
Min–Max Normalization......Page 51
Z-Score Standardization......Page 52
Numerical Methods for Identifying Outliers......Page 53
Exercises......Page 54
Hypothesis Testing versus Exploratory Data Analysis......Page 56
Getting to Know the Data Set......Page 57
Dealing with Correlated Variables......Page 59
Exploring Categorical Variables......Page 60
Using EDA to Uncover Anomalous Fields......Page 65
Exploring Numerical Variables......Page 67
Exploring Multivariate Relationships......Page 74
Selecting Interesting Subsets of the Data for Further Investigation......Page 76
Binning......Page 77
Summary......Page 78
Exercises......Page 79
Data Mining Tasks in Discovering Knowledge in Data......Page 82
Statistical Approaches to Estimation and Prediction......Page 83
Univariate Methods: Measures of Center and Spread......Page 84
Statistical Inference......Page 86
Confidence Interval Estimation......Page 88
Bivariate Methods: Simple Linear Regression......Page 90
Dangers of Extrapolation......Page 94
Prediction Intervals for a Randomly Chosen Value of y Given x......Page 95
Multiple Regression......Page 98
Verifying Model Assumptions......Page 100
Exercises......Page 103
Supervised versus Unsupervised Methods......Page 105
Methodology for Supervised Modeling......Page 106
Bias–Variance Trade-Off......Page 108
Classification Task......Page 110
k-Nearest Neighbor Algorithm......Page 111
Distance Function......Page 114
Simple Unweighted Voting......Page 116
Weighted Voting......Page 117
Quantifying Attribute Relevance: Stretching the Axes......Page 118
k-Nearest Neighbor Algorithm for Estimation and Prediction......Page 119
Choosing k......Page 120
Exercises......Page 121
6 DECISION TREES......Page 122
Classification and Regression Trees......Page 124
C4.5 Algorithm......Page 131
Decision Rules......Page 136
Comparison of the C5.0 and CART Algorithms Applied to Real Data......Page 137
Exercises......Page 141
7 NEURAL NETWORKS......Page 143
Input and Output Encoding......Page 144
Simple Example of a Neural Network......Page 146
Sigmoid Activation Function......Page 149
Gradient Descent Method......Page 150
Back-Propagation Rules......Page 151
Example of Back-Propagation......Page 152
Learning Rate......Page 154
Momentum Term......Page 155
Sensitivity Analysis......Page 157
Application of Neural Network Modeling......Page 158
Exercises......Page 160
Clustering Task......Page 162
Hierarchical Clustering Methods......Page 164
Single-Linkage Clustering......Page 165
Complete-Linkage Clustering......Page 166
Example of k-Means Clustering at Work......Page 168
Application of k-Means Clustering Using SAS Enterprise Miner......Page 173
References......Page 176
Exercises......Page 177
Self-Organizing Maps......Page 178
Kohonen Networks......Page 180
Example of a Kohonen Network Study......Page 181
Application of Clustering Using Kohonen Networks......Page 185
Interpreting the Clusters......Page 186
Cluster Profiles......Page 190
Using Cluster Membership as Input to Downstream Data Mining Models......Page 192
Exercises......Page 193
Affinity Analysis and Market Basket Analysis......Page 195
Data Representation for Market Basket Analysis......Page 197
Support, Confidence, Frequent Itemsets, and the A Priori Property......Page 198
How Does the A Priori AlgorithmWork (Part 1)? Generating Frequent Itemsets......Page 200
How Does the A Priori AlgorithmWork (Part 2)? Generating Association Rules......Page 201
Extension from Flag Data to General Categorical Data......Page 204
J-Measure......Page 205
Application of Generalized Rule Induction......Page 206
When Not to Use Association Rules......Page 208
Do Association Rules Represent Supervised or Unsupervised Learning?......Page 211
Local Patterns versus Global Models......Page 212
Exercises......Page 213
11 MODEL EVALUATION TECHNIQUES......Page 215
Model Evaluation Techniques for the Estimation and Prediction Tasks......Page 216
Error Rate, False Positives, and False Negatives......Page 218
Misclassification Cost Adjustment to Reflect Real-World Concerns......Page 220
Decision Cost/Benefit Analysis......Page 222
Lift Charts and Gains Charts......Page 223
Interweaving Model Evaluation with Model Building......Page 226
Confluence of Results: Applying a Suite of Models......Page 227
Exercises......Page 228
EPILOGUE: "WE'VE ONLY JUST BEGUN"......Page 230
INDEX......Page 232