Data Mining, the automatic extraction of implicit and potentially useful information from data, is increasingly used in commercial, scientific and other application areas. Principles of Data Mining explains and explores the principal techniques of Data Mining: for classification, association rule mining and clustering. Each topic is clearly explained and illustrated by detailed worked examples, with a focus on algorithms rather than mathematical formalism. It is written for readers without a strong background in mathematics or statistics, and any formulae used are explained in detail. This second edition has been expanded to include additional chapters on using frequent pattern trees for Association Rule Mining, comparing classifiers, ensemble classification and dealing with very large volumes of data. Principles of Data Mining aims to help general readers develop the necessary understanding of what is inside the 'black box' so they can use commercial data mining packages discriminatingly, as well as enabling advanced readers or academic researchers to understand or contribute to future technical advances in the field. Suitable as a textbook to support courses at undergraduate or postgraduate levels in a wide range of subjects including Computer Science, Business Studies, Marketing, Artificial Intelligence, Bioinformatics and Forensic Science.
Table of Contents
Cover
Principles of Data Mining, Second Edition
ISBN 9781447148838 ISBN 9781447148845
About This Book
Note on the Second Edition
Contents
1 Introduction to Data Mining
1.1 The Data Explosion
1.2 Knowledge Discovery
1.3 Applications of Data Mining
1.4 Labelled and Unlabelled Data
1.5 Supervised Learning: Classi.cation
1.6 Supervised Learning: Numerical Prediction
1.7 Unsupervised Learning: Association Rules
1.8 Unsupervised Learning: Clustering
2 Data for Data Mining
2.1 Standard Formulation
2.2 Types of Variable
2.2.1 Categorical and Continuous Attributes
2.3 Data Preparation
2.3.1 Data Cleaning
2.4 Missing Values
2.4.1 Discard Instances
2.4.2 Replace by Most Frequent/Average Value
2.5 Reducing the Number of Attributes
2.6 The UCI Repository of Datasets
2.7 Chapter Summary
2.8 Self-assessment Exercises for Chapter 2
3 Introduction to Classi.cation: Na�ive Bayes and Nearest Neighbour
3.1 What Is Classi.cation?
3.2 Na�ive Bayes Classi.ers
3.3 Nearest Neighbour Classi.cation
3.3.1 Distance Measures
3.3.2 Normalisation
3.3.3 Dealing with Categorical Attributes
3.4 Eager and Lazy Learning
3.5 Chapter Summary
3.6 Self-assessment Exercises for Chapter 3
4 Using Decision Trees for Classi.cation
4.1 Decision Rules and Decision Trees
4.1.1 Decision Trees: The Golf Example
4.1.2 Terminology
4.1.3 The degrees Dataset
4.2 The TDIDT Algorithm
4.3 Types of Reasoning
4.4 Chapter Summary
4.5 Self-assessment Exercises for Chapter 4
5 Decision Tree Induction: Using Entropy for Attribute Selection
5.1 Attribute Selection: An Experiment
5.2 Alternative Decision Trees
5.2.1 The Football/Netball Example
5.2.2 The anonymous Dataset
5.3 Choosing Attributes to Split On: Using Entropy
5.3.1 The lens24 Dataset
5.3.2 Entropy
5.3.3 Using Entropy for Attribute Selection
5.3.4 Maximising Information Gain
5.4 Chapter Summary
5.5 Self-assessment Exercises for Chapter 5
6 Decision Tree Induction: Using Frequency Tables for Attribute Selection
6.1 Calculating Entropy in Practice
6.1.1 Proof of Equivalence
6.1.2 A Note on Zeros
6.2 Other Attribute Selection Criteria: Gini Index of Diversity
6.3 The .2 Attribute Selection Criterion
6.4 Inductive Bias
6.5 Using Gain Ratio for Attribute Selection
6.5.1 Properties of Split Information
6.5.2 Summary
6.6 Number of Rules Generated by Di.erent Attribute Selection Criteria
6.7 Missing Branches
6.8 Chapter Summary
6.9 Self-assessment Exercises for Chapter 6
7 Estimating the Predictive Accuracy of a Classi.er
7.1 Introduction
7.2 Method 1: Separate Training and Test Sets
7.2.1 Standard Error
7.2.2 Repeated Train and Test
7.3 Method 2: k-fold Cross-validation
7.4 Method 3: N -fold Cross-validation
7.5 Experimental Results I
7.6 Experimental Results II: Datasets with Missing Values
7.6.1 Strategy 1: Discard Instances
7.6.2 Strategy 2: Replace by Most Frequent/Average Value
7.6.3 Missing Classi.cations
7.7 Confusion Matrix
7.7.1 True and False Positives
7.8 Chapter Summary
7.9 Self-assessment Exercises for Chapter 7
8 Continuous Attributes
8.1 Introduction
8.2 Local versus Global Discretisation
8.3 Adding Local Discretisation to TDIDT
8.3.1 Calculating the Information Gain of a Set of Pseudo-attributes
8.3.2 Computational E.ciency
8.4 Using the ChiMerge Algorithm for Global Discretisation
8.4.1 Calculating the Expected Values and .2
8.4.2 Finding the Threshold Value
8.4.3 Setting minIntervals and maxIntervals
8.4.4 The ChiMerge Algorithm: Summary
8.4.5 The ChiMerge Algorithm: Comments
8.5 Comparing Global and Local Discretisation for Tree Induction
8.6 Chapter Summary
8.7 Self-assessment Exercises for Chapter 8
9 Avoiding Over.tting of Decision Trees
9.1 Dealing with Clashes in a Training Set
9.1.1 Adapting TDIDT to Deal with Clashes
9.2 More About Over.tting Rules to Data
9.3 Pre-pruning Decision Trees
9.4 Post-pruning Decision Trees
9.5 Chapter Summary
9.6 Self-assessment Exercise for Chapter 9
10 More About Entropy
10.1 Introduction
10.2 Coding Information Using Bits
10.3 Discriminating Amongst M Values (M Not a Power of 2)
10.4 Encoding Values That Are Not Equally Likely
10.5 Entropy of a Training Set
10.6 Information Gain Must Be Positive or Zero
10.7 Using Information Gain for Feature Reduction for Classi.cation Tasks
10.7.1 Example 1: The genetics Dataset
10.7.2 Example 2: The bcst96 Dataset
10.8 Chapter Summary
10.9 Self-assessment Exercises for Chapter 10
11 Inducing Modular Rules for Classi.cation
11.1 Rule Post-pruning
11.2 Con.ict Resolution
11.3 Problems with Decision Trees
11.4 The Prism Algorithm
11.4.1 Changes to the Basic Prism Algorithm
11.4.2 Comparing Prism with TDIDT
11.5 Chapter Summary
11.6 Self-assessment Exercise for Chapter 11
12 Measuring the Performance of a Classi.er
12.1 True and False Positives and Negatives
12.2 Performance Measures
12.3 True and False Positive Rates versus Predictive Accuracy
12.4 ROC Graphs
12.5 ROC Curves
12.6 Finding the Best Classi.er
12.7 Chapter Summary
12.8 Self-assessment Exercise for Chapter 12
13 Dealing with Large Volumes of Data
13.1 Introduction
13.2 Distributing Data onto Multiple Processors
13.3 Case Study: PMCRI
13.4 Evaluating the E.ectiveness of a Distributed System: PMCRI
13.5 Revising a Classi.er Incrementally
13.6 Chapter Summary
13.7 Self-assessment Exercises for Chapter 13
14 Ensemble Classi.cation
14.1 Introduction
14.2 Estimating the Performance of a Classi.er
14.3 Selecting a Di.erent Training Set for Each Classi.er
14.4 Selecting a Di.erent Set of Attributes for Each Classi.er
14.5 Combining Classi.cations: Alternative Voting Systems
14.6 Parallel Ensemble Classi.ers
14.7 Chapter Summary
14.8 Self-assessment Exercises for Chapter 14
15 Comparing Classi.ers
15.1 Introduction
15.2 The Paired t-Test
15.3 Choosing Datasets for Comparative Evaluation
15.3.1 Con.dence Intervals
15.4 Sampling
15.5 How Bad Is a `No Signi.cant Di.erence' Result?
15.6 Chapter Summary
15.7 Self-assessment Exercises for Chapter 15
16 Association Rule Mining I
16.1 Introduction
16.2 Measures of Rule Interestingness
16.2.1 The Piatetsky-Shapiro Criteria and the RI Measure
16.2.2 Rule Interestingness Measures Applied to the
16.2.3 Using Rule Interestingness Measures for Con.ict Resolution
16.3 Association Rule Mining Tasks
16.4 Finding the Best N Rules
16.4.1 The J -Measure: Measuring the Information Content of a Rule
16.4.2 Search Strategy
16.5 Chapter Summary
16.6 Self-assessment Exercises for Chapter 16
17 Association Rule Mining II
17.1 Introduction
17.2 Transactions and Itemsets
17.3 Support for an Itemset
17.4 Association Rules
17.5 Generating Association Rules
17.6 Apriori
17.7 Generating Supported Itemsets: An Example
17.8 Generating Rules for a Supported Itemset
17.9 Rule Interestingness Measures: Lift and Leverage
17.10 Chapter Summary
17.11 Self-assessment Exercises for Chapter 17
18 Association Rule Mining III: Frequent Pattern Trees
18.1 Introduction: FP-Growth
18.2 Constructing the FP-tree
18.2.1 Pre-processing the Transaction Database
18.2.2 Initialisation
18.2.3 Processing Transaction 1: f, c, a, m, p
18.2.4 Processing Transaction 2: f, c, a, b, m
18.2.5 Processing Transaction 3: f, b
18.2.6 Processing Transaction 4: c, b, p
18.2.7 Processing Transaction 5: f, c, a, m, p
18.3 Finding the Frequent Itemsets from the FP-tree
18.3.1 Itemsets Ending with Item p
18.3.2 Itemsets Ending with Item m
18.4 Chapter Summary
18.5 Self-assessment Exercises for Chapter 18
19 Clustering
19.1 Introduction
19.2 k-Means Clustering
19.2.1 Example
19.2.2 Finding the Best Set of Clusters
19.3 Agglomerative Hierarchical Clustering
19.3.1 Recording the Distance Between Clusters
19.3.2 Terminating the Clustering Process
19.4 Chapter Summary
19.5 Self-assessment Exercises for Chapter 19
20 Text Mining
20.1 Multiple Classi.cations
20.2 Representing Text Documents for Data Mining
20.3 Stop Words and Stemming
20.4 Using Information Gain for Feature Reduction
20.5 Representing Text Documents: Constructing a Vector Space Model
20.6 Normalising the Weights
20.7 Measuring the Distance Between Two Vectors
20.8 Measuring the Performance of a Text Classi.er
20.9 Hypertext Categorisation
20.9.1 Classifying Web Pages
20.9.2 Hypertext Classi.cation versus Text Classi.cation
20.10 Chapter Summary
20.11 Self-assessment Exercises for Chapter 20
A Essential Mathematics
B Datasets
C Sources of Further Information
D Glossary and Notation
E Solutions to Self-assessment Exercises
Index
Author(s): Max Bramer
Series: Undergraduate Topics in Computer Science
Edition: 2nd ed. 2013
Publisher: Springer
Year: 2013
Language: English
Pages: 455
Cover......Page 1
Principles of Data Mining, Second Edition......Page 4
ISBN 9781447148838 ISBN 9781447148845......Page 5
About This Book......Page 6
Note on the Second Edition......Page 7
Contents......Page 8
1.1 The Data Explosion......Page 16
1.2 Knowledge Discovery......Page 17
1.3 Applications of Data Mining......Page 18
1.4 Labelled and Unlabelled Data......Page 19
1.5 Supervised Learning: Classi.cation......Page 20
1.7 Unsupervised Learning: Association Rules......Page 22
1.8 Unsupervised Learning: Clustering......Page 23
2.1 Standard Formulation......Page 24
2.2 Types of Variable......Page 25
2.3 Data Preparation......Page 27
2.3.1 Data Cleaning......Page 28
2.4.2 Replace by Most Frequent/Average Value......Page 30
2.5 Reducing the Number of Attributes......Page 31
2.6 The UCI Repository of Datasets......Page 32
2.8 Self-assessment Exercises for Chapter 2......Page 33
3.1 What Is Classi.cation?......Page 36
3.2 Na¨ive Bayes Classi.ers......Page 37
3.3 Nearest Neighbour Classi.cation......Page 44
3.3.1 Distance Measures......Page 47
3.3.2 Normalisation......Page 50
3.4 Eager and Lazy Learning......Page 51
3.6 Self-assessment Exercises for Chapter 3......Page 52
4.1 Decision Rules and Decision Trees......Page 54
4.1.1 Decision Trees: The Golf Example......Page 55
4.1.2 Terminology......Page 56
4.1.3 The degrees Dataset......Page 57
4.2 The TDIDT Algorithm......Page 60
4.3 Types of Reasoning......Page 62
4.5 Self-assessment Exercises for Chapter 4......Page 63
5.1 Attribute Selection: An Experiment......Page 64
5.2 Alternative Decision Trees......Page 65
5.2.1 The Football/Netball Example......Page 66
5.2.2 The anonymous Dataset......Page 68
5.3 Choosing Attributes to Split On: Using Entropy......Page 69
5.3.1 The lens24 Dataset......Page 70
5.3.2 Entropy......Page 72
5.3.3 Using Entropy for Attribute Selection......Page 73
5.3.4 Maximising Information Gain......Page 75
5.5 Self-assessment Exercises for Chapter 5......Page 76
6.1 Calculating Entropy in Practice......Page 78
6.1.1 Proof of Equivalence......Page 79
6.2 Other Attribute Selection Criteria: Gini Index of Diversity......Page 81
6.3 The .2 Attribute Selection Criterion......Page 83
6.4 Inductive Bias......Page 86
6.5 Using Gain Ratio for Attribute Selection......Page 88
6.5.1 Properties of Split Information......Page 89
6.6 Number of Rules Generated by Di.erent Attribute Selection Criteria......Page 90
6.7 Missing Branches......Page 91
6.9 Self-assessment Exercises for Chapter 6......Page 92
7.1 Introduction......Page 94
7.2 Method 1: Separate Training and Test Sets......Page 95
7.2.1 Standard Error......Page 96
7.3 Method 2: k-fold Cross-validation......Page 97
7.4 Method 3: N -fold Cross-validation......Page 98
7.5 Experimental Results I......Page 99
7.6 Experimental Results II: Datasets with Missing Values......Page 101
7.6.2 Strategy 2: Replace by Most Frequent/Average Value......Page 102
7.7 Confusion Matrix......Page 104
7.7.1 True and False Positives......Page 105
7.9 Self-assessment Exercises for Chapter 7......Page 106
8.1 Introduction......Page 108
8.2 Local versus Global Discretisation......Page 110
8.3 Adding Local Discretisation to TDIDT......Page 111
8.3.1 Calculating the Information Gain of a Set of Pseudo-attributes......Page 112
8.3.2 Computational E.ciency......Page 117
8.4 Using the ChiMerge Algorithm for Global Discretisation......Page 120
8.4.1 Calculating the Expected Values and .2......Page 123
8.4.3 Setting minIntervals and maxIntervals......Page 128
8.4.5 The ChiMerge Algorithm: Comments......Page 130
8.5 Comparing Global and Local Discretisation for Tree Induction......Page 131
8.7 Self-assessment Exercises for Chapter 8......Page 133
9 Avoiding Over.tting of Decision Trees......Page 136
9.1.1 Adapting TDIDT to Deal with Clashes......Page 137
9.2 More About Over.tting Rules to Data......Page 142
9.3 Pre-pruning Decision Trees......Page 143
9.4 Post-pruning Decision Trees......Page 145
9.5 Chapter Summary......Page 150
9.6 Self-assessment Exercise for Chapter 9......Page 151
10.1 Introduction......Page 152
10.2 Coding Information Using Bits......Page 155
10.3 Discriminating Amongst M Values (M Not a Power of 2)......Page 157
10.4 Encoding Values That Are Not Equally Likely......Page 158
10.5 Entropy of a Training Set......Page 161
10.6 Information Gain Must Be Positive or Zero......Page 162
10.7 Using Information Gain for Feature Reduction for Classi.cation Tasks......Page 164
10.7.1 Example 1: The genetics Dataset......Page 165
10.7.2 Example 2: The bcst96 Dataset......Page 169
10.9 Self-assessment Exercises for Chapter 10......Page 171
11.1 Rule Post-pruning......Page 172
11.2 Con.ict Resolution......Page 174
11.3 Problems with Decision Trees......Page 177
11.4 The Prism Algorithm......Page 179
11.4.1 Changes to the Basic Prism Algorithm......Page 186
11.4.2 Comparing Prism with TDIDT......Page 187
11.6 Self-assessment Exercise for Chapter 11......Page 188
12 Measuring the Performance of a Classi.er......Page 190
12.1 True and False Positives and Negatives......Page 191
12.2 Performance Measures......Page 193
12.3 True and False Positive Rates versus Predictive Accuracy......Page 196
12.4 ROC Graphs......Page 197
12.5 ROC Curves......Page 199
12.6 Finding the Best Classi.er......Page 200
12.7 Chapter Summary......Page 201
12.8 Self-assessment Exercise for Chapter 12......Page 202
13.1 Introduction......Page 204
13.2 Distributing Data onto Multiple Processors......Page 207
13.3 Case Study: PMCRI......Page 209
13.4 Evaluating the E.ectiveness of a Distributed System: PMCRI......Page 212
13.5 Revising a Classi.er Incrementally......Page 216
13.7 Self-assessment Exercises for Chapter 13......Page 222
14.1 Introduction......Page 224
14.2 Estimating the Performance of a Classi.er......Page 227
14.3 Selecting a Di.erent Training Set for Each Classi.er......Page 228
14.4 Selecting a Di.erent Set of Attributes for Each Classi.er......Page 229
14.5 Combining Classi.cations: Alternative Voting Systems......Page 230
14.7 Chapter Summary......Page 234
14.8 Self-assessment Exercises for Chapter 14......Page 235
15.1 Introduction......Page 236
15.2 The Paired t-Test......Page 238
15.3 Choosing Datasets for Comparative Evaluation......Page 244
15.4 Sampling......Page 246
15.5 How Bad Is a ‘No Signi.cant Di.erence’ Result?......Page 249
15.7 Self-assessment Exercises for Chapter 15......Page 250
16.1 Introduction......Page 252
16.2 Measures of Rule Interestingness......Page 254
16.2.1 The Piatetsky-Shapiro Criteria and the RI Measure......Page 256
16.2.2 Rule Interestingness Measures Applied to the......Page 258
16.3 Association Rule Mining Tasks......Page 260
16.4 Finding the Best N Rules......Page 261
16.4.1 The J -Measure: Measuring the Information Content of a Rule......Page 262
16.4.2 Search Strategy......Page 263
16.6 Self-assessment Exercises for Chapter 16......Page 266
17.1 Introduction......Page 268
17.2 Transactions and Itemsets......Page 269
17.3 Support for an Itemset......Page 270
17.4 Association Rules......Page 271
17.5 Generating Association Rules......Page 273
17.6 Apriori......Page 274
17.7 Generating Supported Itemsets: An Example......Page 277
17.8 Generating Rules for a Supported Itemset......Page 279
17.9 Rule Interestingness Measures: Lift and Leverage......Page 281
17.10 Chapter Summary......Page 283
17.11 Self-assessment Exercises for Chapter 17......Page 284
18.1 Introduction: FP-Growth......Page 286
18.2.1 Pre-processing the Transaction Database......Page 289
18.2.2 Initialisation......Page 291
18.2.3 Processing Transaction 1: f, c, a, m, p......Page 292
18.2.4 Processing Transaction 2: f, c, a, b, m......Page 294
18.2.5 Processing Transaction 3: f, b......Page 298
18.2.6 Processing Transaction 4: c, b, p......Page 300
18.2.7 Processing Transaction 5: f, c, a, m, p......Page 302
18.3 Finding the Frequent Itemsets from the FP-tree......Page 303
18.3.1 Itemsets Ending with Item p......Page 306
18.3.2 Itemsets Ending with Item m......Page 316
18.4 Chapter Summary......Page 323
18.5 Self-assessment Exercises for Chapter 18......Page 324
19.1 Introduction......Page 326
19.2 k-Means Clustering......Page 329
19.2.1 Example......Page 330
19.2.2 Finding the Best Set of Clusters......Page 334
19.3 Agglomerative Hierarchical Clustering......Page 335
19.3.1 Recording the Distance Between Clusters......Page 338
19.3.2 Terminating the Clustering Process......Page 341
19.5 Self-assessment Exercises for Chapter 19......Page 342
20.1 Multiple Classi.cations......Page 344
20.2 Representing Text Documents for Data Mining......Page 345
20.3 Stop Words and Stemming......Page 347
20.5 Representing Text Documents: Constructing a Vector Space Model......Page 348
20.6 Normalising the Weights......Page 350
20.7 Measuring the Distance Between Two Vectors......Page 351
20.8 Measuring the Performance of a Text Classi.er......Page 352
20.9.1 Classifying Web Pages......Page 353
20.9.2 Hypertext Classi.cation versus Text Classi.cation......Page 354
20.11 Self-assessment Exercises for Chapter 20......Page 358
A Essential Mathematics......Page 360
B Datasets......Page 376
C Sources of Further Information......Page 398
D Glossary and Notation......Page 402
E Solutions to Self-assessment Exercises......Page 422
Index......Page 450