Principles of Data Mining

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Data Mining, the automatic extraction of implicit and potentially useful information from data, is increasingly used in commercial, scientific and other application areas. Principles of Data Mining explains and explores the principal techniques of Data Mining: for classification, association rule mining and clustering. Each topic is clearly explained and illustrated by detailed worked examples, with a focus on algorithms rather than mathematical formalism. It is written for readers without a strong background in mathematics or statistics, and any formulae used are explained in detail. This second edition has been expanded to include additional chapters on using frequent pattern trees for Association Rule Mining, comparing classifiers, ensemble classification and dealing with very large volumes of data. Principles of Data Mining aims to help general readers develop the necessary understanding of what is inside the 'black box' so they can use commercial data mining packages discriminatingly, as well as enabling advanced readers or academic researchers to understand or contribute to future technical advances in the field. Suitable as a textbook to support courses at undergraduate or postgraduate levels in a wide range of subjects including Computer Science, Business Studies, Marketing, Artificial Intelligence, Bioinformatics and Forensic Science. Table of Contents Cover Principles of Data Mining, Second Edition ISBN 9781447148838 ISBN 9781447148845 About This Book Note on the Second Edition Contents 1 Introduction to Data Mining 1.1 The Data Explosion 1.2 Knowledge Discovery 1.3 Applications of Data Mining 1.4 Labelled and Unlabelled Data 1.5 Supervised Learning: Classi.cation 1.6 Supervised Learning: Numerical Prediction 1.7 Unsupervised Learning: Association Rules 1.8 Unsupervised Learning: Clustering 2 Data for Data Mining 2.1 Standard Formulation 2.2 Types of Variable 2.2.1 Categorical and Continuous Attributes 2.3 Data Preparation 2.3.1 Data Cleaning 2.4 Missing Values 2.4.1 Discard Instances 2.4.2 Replace by Most Frequent/Average Value 2.5 Reducing the Number of Attributes 2.6 The UCI Repository of Datasets 2.7 Chapter Summary 2.8 Self-assessment Exercises for Chapter 2 3 Introduction to Classi.cation: Na�ive Bayes and Nearest Neighbour 3.1 What Is Classi.cation? 3.2 Na�ive Bayes Classi.ers 3.3 Nearest Neighbour Classi.cation 3.3.1 Distance Measures 3.3.2 Normalisation 3.3.3 Dealing with Categorical Attributes 3.4 Eager and Lazy Learning 3.5 Chapter Summary 3.6 Self-assessment Exercises for Chapter 3 4 Using Decision Trees for Classi.cation 4.1 Decision Rules and Decision Trees 4.1.1 Decision Trees: The Golf Example 4.1.2 Terminology 4.1.3 The degrees Dataset 4.2 The TDIDT Algorithm 4.3 Types of Reasoning 4.4 Chapter Summary 4.5 Self-assessment Exercises for Chapter 4 5 Decision Tree Induction: Using Entropy for Attribute Selection 5.1 Attribute Selection: An Experiment 5.2 Alternative Decision Trees 5.2.1 The Football/Netball Example 5.2.2 The anonymous Dataset 5.3 Choosing Attributes to Split On: Using Entropy 5.3.1 The lens24 Dataset 5.3.2 Entropy 5.3.3 Using Entropy for Attribute Selection 5.3.4 Maximising Information Gain 5.4 Chapter Summary 5.5 Self-assessment Exercises for Chapter 5 6 Decision Tree Induction: Using Frequency Tables for Attribute Selection 6.1 Calculating Entropy in Practice 6.1.1 Proof of Equivalence 6.1.2 A Note on Zeros 6.2 Other Attribute Selection Criteria: Gini Index of Diversity 6.3 The .2 Attribute Selection Criterion 6.4 Inductive Bias 6.5 Using Gain Ratio for Attribute Selection 6.5.1 Properties of Split Information 6.5.2 Summary 6.6 Number of Rules Generated by Di.erent Attribute Selection Criteria 6.7 Missing Branches 6.8 Chapter Summary 6.9 Self-assessment Exercises for Chapter 6 7 Estimating the Predictive Accuracy of a Classi.er 7.1 Introduction 7.2 Method 1: Separate Training and Test Sets 7.2.1 Standard Error 7.2.2 Repeated Train and Test 7.3 Method 2: k-fold Cross-validation 7.4 Method 3: N -fold Cross-validation 7.5 Experimental Results I 7.6 Experimental Results II: Datasets with Missing Values 7.6.1 Strategy 1: Discard Instances 7.6.2 Strategy 2: Replace by Most Frequent/Average Value 7.6.3 Missing Classi.cations 7.7 Confusion Matrix 7.7.1 True and False Positives 7.8 Chapter Summary 7.9 Self-assessment Exercises for Chapter 7 8 Continuous Attributes 8.1 Introduction 8.2 Local versus Global Discretisation 8.3 Adding Local Discretisation to TDIDT 8.3.1 Calculating the Information Gain of a Set of Pseudo-attributes 8.3.2 Computational E.ciency 8.4 Using the ChiMerge Algorithm for Global Discretisation 8.4.1 Calculating the Expected Values and .2 8.4.2 Finding the Threshold Value 8.4.3 Setting minIntervals and maxIntervals 8.4.4 The ChiMerge Algorithm: Summary 8.4.5 The ChiMerge Algorithm: Comments 8.5 Comparing Global and Local Discretisation for Tree Induction 8.6 Chapter Summary 8.7 Self-assessment Exercises for Chapter 8 9 Avoiding Over.tting of Decision Trees 9.1 Dealing with Clashes in a Training Set 9.1.1 Adapting TDIDT to Deal with Clashes 9.2 More About Over.tting Rules to Data 9.3 Pre-pruning Decision Trees 9.4 Post-pruning Decision Trees 9.5 Chapter Summary 9.6 Self-assessment Exercise for Chapter 9 10 More About Entropy 10.1 Introduction 10.2 Coding Information Using Bits 10.3 Discriminating Amongst M Values (M Not a Power of 2) 10.4 Encoding Values That Are Not Equally Likely 10.5 Entropy of a Training Set 10.6 Information Gain Must Be Positive or Zero 10.7 Using Information Gain for Feature Reduction for Classi.cation Tasks 10.7.1 Example 1: The genetics Dataset 10.7.2 Example 2: The bcst96 Dataset 10.8 Chapter Summary 10.9 Self-assessment Exercises for Chapter 10 11 Inducing Modular Rules for Classi.cation 11.1 Rule Post-pruning 11.2 Con.ict Resolution 11.3 Problems with Decision Trees 11.4 The Prism Algorithm 11.4.1 Changes to the Basic Prism Algorithm 11.4.2 Comparing Prism with TDIDT 11.5 Chapter Summary 11.6 Self-assessment Exercise for Chapter 11 12 Measuring the Performance of a Classi.er 12.1 True and False Positives and Negatives 12.2 Performance Measures 12.3 True and False Positive Rates versus Predictive Accuracy 12.4 ROC Graphs 12.5 ROC Curves 12.6 Finding the Best Classi.er 12.7 Chapter Summary 12.8 Self-assessment Exercise for Chapter 12 13 Dealing with Large Volumes of Data 13.1 Introduction 13.2 Distributing Data onto Multiple Processors 13.3 Case Study: PMCRI 13.4 Evaluating the E.ectiveness of a Distributed System: PMCRI 13.5 Revising a Classi.er Incrementally 13.6 Chapter Summary 13.7 Self-assessment Exercises for Chapter 13 14 Ensemble Classi.cation 14.1 Introduction 14.2 Estimating the Performance of a Classi.er 14.3 Selecting a Di.erent Training Set for Each Classi.er 14.4 Selecting a Di.erent Set of Attributes for Each Classi.er 14.5 Combining Classi.cations: Alternative Voting Systems 14.6 Parallel Ensemble Classi.ers 14.7 Chapter Summary 14.8 Self-assessment Exercises for Chapter 14 15 Comparing Classi.ers 15.1 Introduction 15.2 The Paired t-Test 15.3 Choosing Datasets for Comparative Evaluation 15.3.1 Con.dence Intervals 15.4 Sampling 15.5 How Bad Is a `No Signi.cant Di.erence' Result? 15.6 Chapter Summary 15.7 Self-assessment Exercises for Chapter 15 16 Association Rule Mining I 16.1 Introduction 16.2 Measures of Rule Interestingness 16.2.1 The Piatetsky-Shapiro Criteria and the RI Measure 16.2.2 Rule Interestingness Measures Applied to the 16.2.3 Using Rule Interestingness Measures for Con.ict Resolution 16.3 Association Rule Mining Tasks 16.4 Finding the Best N Rules 16.4.1 The J -Measure: Measuring the Information Content of a Rule 16.4.2 Search Strategy 16.5 Chapter Summary 16.6 Self-assessment Exercises for Chapter 16 17 Association Rule Mining II 17.1 Introduction 17.2 Transactions and Itemsets 17.3 Support for an Itemset 17.4 Association Rules 17.5 Generating Association Rules 17.6 Apriori 17.7 Generating Supported Itemsets: An Example 17.8 Generating Rules for a Supported Itemset 17.9 Rule Interestingness Measures: Lift and Leverage 17.10 Chapter Summary 17.11 Self-assessment Exercises for Chapter 17 18 Association Rule Mining III: Frequent Pattern Trees 18.1 Introduction: FP-Growth 18.2 Constructing the FP-tree 18.2.1 Pre-processing the Transaction Database 18.2.2 Initialisation 18.2.3 Processing Transaction 1: f, c, a, m, p 18.2.4 Processing Transaction 2: f, c, a, b, m 18.2.5 Processing Transaction 3: f, b 18.2.6 Processing Transaction 4: c, b, p 18.2.7 Processing Transaction 5: f, c, a, m, p 18.3 Finding the Frequent Itemsets from the FP-tree 18.3.1 Itemsets Ending with Item p 18.3.2 Itemsets Ending with Item m 18.4 Chapter Summary 18.5 Self-assessment Exercises for Chapter 18 19 Clustering 19.1 Introduction 19.2 k-Means Clustering 19.2.1 Example 19.2.2 Finding the Best Set of Clusters 19.3 Agglomerative Hierarchical Clustering 19.3.1 Recording the Distance Between Clusters 19.3.2 Terminating the Clustering Process 19.4 Chapter Summary 19.5 Self-assessment Exercises for Chapter 19 20 Text Mining 20.1 Multiple Classi.cations 20.2 Representing Text Documents for Data Mining 20.3 Stop Words and Stemming 20.4 Using Information Gain for Feature Reduction 20.5 Representing Text Documents: Constructing a Vector Space Model 20.6 Normalising the Weights 20.7 Measuring the Distance Between Two Vectors 20.8 Measuring the Performance of a Text Classi.er 20.9 Hypertext Categorisation 20.9.1 Classifying Web Pages 20.9.2 Hypertext Classi.cation versus Text Classi.cation 20.10 Chapter Summary 20.11 Self-assessment Exercises for Chapter 20 A Essential Mathematics B Datasets C Sources of Further Information D Glossary and Notation E Solutions to Self-assessment Exercises Index

Author(s): Max Bramer
Series: Undergraduate Topics in Computer Science
Edition: 2nd ed. 2013
Publisher: Springer
Year: 2013

Language: English
Pages: 455

Cover......Page 1
Principles of Data Mining, Second Edition......Page 4
ISBN 9781447148838 ISBN 9781447148845......Page 5
About This Book......Page 6
Note on the Second Edition......Page 7
Contents......Page 8
1.1 The Data Explosion......Page 16
1.2 Knowledge Discovery......Page 17
1.3 Applications of Data Mining......Page 18
1.4 Labelled and Unlabelled Data......Page 19
1.5 Supervised Learning: Classi.cation......Page 20
1.7 Unsupervised Learning: Association Rules......Page 22
1.8 Unsupervised Learning: Clustering......Page 23
2.1 Standard Formulation......Page 24
2.2 Types of Variable......Page 25
2.3 Data Preparation......Page 27
2.3.1 Data Cleaning......Page 28
2.4.2 Replace by Most Frequent/Average Value......Page 30
2.5 Reducing the Number of Attributes......Page 31
2.6 The UCI Repository of Datasets......Page 32
2.8 Self-assessment Exercises for Chapter 2......Page 33
3.1 What Is Classi.cation?......Page 36
3.2 Na¨ive Bayes Classi.ers......Page 37
3.3 Nearest Neighbour Classi.cation......Page 44
3.3.1 Distance Measures......Page 47
3.3.2 Normalisation......Page 50
3.4 Eager and Lazy Learning......Page 51
3.6 Self-assessment Exercises for Chapter 3......Page 52
4.1 Decision Rules and Decision Trees......Page 54
4.1.1 Decision Trees: The Golf Example......Page 55
4.1.2 Terminology......Page 56
4.1.3 The degrees Dataset......Page 57
4.2 The TDIDT Algorithm......Page 60
4.3 Types of Reasoning......Page 62
4.5 Self-assessment Exercises for Chapter 4......Page 63
5.1 Attribute Selection: An Experiment......Page 64
5.2 Alternative Decision Trees......Page 65
5.2.1 The Football/Netball Example......Page 66
5.2.2 The anonymous Dataset......Page 68
5.3 Choosing Attributes to Split On: Using Entropy......Page 69
5.3.1 The lens24 Dataset......Page 70
5.3.2 Entropy......Page 72
5.3.3 Using Entropy for Attribute Selection......Page 73
5.3.4 Maximising Information Gain......Page 75
5.5 Self-assessment Exercises for Chapter 5......Page 76
6.1 Calculating Entropy in Practice......Page 78
6.1.1 Proof of Equivalence......Page 79
6.2 Other Attribute Selection Criteria: Gini Index of Diversity......Page 81
6.3 The .2 Attribute Selection Criterion......Page 83
6.4 Inductive Bias......Page 86
6.5 Using Gain Ratio for Attribute Selection......Page 88
6.5.1 Properties of Split Information......Page 89
6.6 Number of Rules Generated by Di.erent Attribute Selection Criteria......Page 90
6.7 Missing Branches......Page 91
6.9 Self-assessment Exercises for Chapter 6......Page 92
7.1 Introduction......Page 94
7.2 Method 1: Separate Training and Test Sets......Page 95
7.2.1 Standard Error......Page 96
7.3 Method 2: k-fold Cross-validation......Page 97
7.4 Method 3: N -fold Cross-validation......Page 98
7.5 Experimental Results I......Page 99
7.6 Experimental Results II: Datasets with Missing Values......Page 101
7.6.2 Strategy 2: Replace by Most Frequent/Average Value......Page 102
7.7 Confusion Matrix......Page 104
7.7.1 True and False Positives......Page 105
7.9 Self-assessment Exercises for Chapter 7......Page 106
8.1 Introduction......Page 108
8.2 Local versus Global Discretisation......Page 110
8.3 Adding Local Discretisation to TDIDT......Page 111
8.3.1 Calculating the Information Gain of a Set of Pseudo-attributes......Page 112
8.3.2 Computational E.ciency......Page 117
8.4 Using the ChiMerge Algorithm for Global Discretisation......Page 120
8.4.1 Calculating the Expected Values and .2......Page 123
8.4.3 Setting minIntervals and maxIntervals......Page 128
8.4.5 The ChiMerge Algorithm: Comments......Page 130
8.5 Comparing Global and Local Discretisation for Tree Induction......Page 131
8.7 Self-assessment Exercises for Chapter 8......Page 133
9 Avoiding Over.tting of Decision Trees......Page 136
9.1.1 Adapting TDIDT to Deal with Clashes......Page 137
9.2 More About Over.tting Rules to Data......Page 142
9.3 Pre-pruning Decision Trees......Page 143
9.4 Post-pruning Decision Trees......Page 145
9.5 Chapter Summary......Page 150
9.6 Self-assessment Exercise for Chapter 9......Page 151
10.1 Introduction......Page 152
10.2 Coding Information Using Bits......Page 155
10.3 Discriminating Amongst M Values (M Not a Power of 2)......Page 157
10.4 Encoding Values That Are Not Equally Likely......Page 158
10.5 Entropy of a Training Set......Page 161
10.6 Information Gain Must Be Positive or Zero......Page 162
10.7 Using Information Gain for Feature Reduction for Classi.cation Tasks......Page 164
10.7.1 Example 1: The genetics Dataset......Page 165
10.7.2 Example 2: The bcst96 Dataset......Page 169
10.9 Self-assessment Exercises for Chapter 10......Page 171
11.1 Rule Post-pruning......Page 172
11.2 Con.ict Resolution......Page 174
11.3 Problems with Decision Trees......Page 177
11.4 The Prism Algorithm......Page 179
11.4.1 Changes to the Basic Prism Algorithm......Page 186
11.4.2 Comparing Prism with TDIDT......Page 187
11.6 Self-assessment Exercise for Chapter 11......Page 188
12 Measuring the Performance of a Classi.er......Page 190
12.1 True and False Positives and Negatives......Page 191
12.2 Performance Measures......Page 193
12.3 True and False Positive Rates versus Predictive Accuracy......Page 196
12.4 ROC Graphs......Page 197
12.5 ROC Curves......Page 199
12.6 Finding the Best Classi.er......Page 200
12.7 Chapter Summary......Page 201
12.8 Self-assessment Exercise for Chapter 12......Page 202
13.1 Introduction......Page 204
13.2 Distributing Data onto Multiple Processors......Page 207
13.3 Case Study: PMCRI......Page 209
13.4 Evaluating the E.ectiveness of a Distributed System: PMCRI......Page 212
13.5 Revising a Classi.er Incrementally......Page 216
13.7 Self-assessment Exercises for Chapter 13......Page 222
14.1 Introduction......Page 224
14.2 Estimating the Performance of a Classi.er......Page 227
14.3 Selecting a Di.erent Training Set for Each Classi.er......Page 228
14.4 Selecting a Di.erent Set of Attributes for Each Classi.er......Page 229
14.5 Combining Classi.cations: Alternative Voting Systems......Page 230
14.7 Chapter Summary......Page 234
14.8 Self-assessment Exercises for Chapter 14......Page 235
15.1 Introduction......Page 236
15.2 The Paired t-Test......Page 238
15.3 Choosing Datasets for Comparative Evaluation......Page 244
15.4 Sampling......Page 246
15.5 How Bad Is a ‘No Signi.cant Di.erence’ Result?......Page 249
15.7 Self-assessment Exercises for Chapter 15......Page 250
16.1 Introduction......Page 252
16.2 Measures of Rule Interestingness......Page 254
16.2.1 The Piatetsky-Shapiro Criteria and the RI Measure......Page 256
16.2.2 Rule Interestingness Measures Applied to the......Page 258
16.3 Association Rule Mining Tasks......Page 260
16.4 Finding the Best N Rules......Page 261
16.4.1 The J -Measure: Measuring the Information Content of a Rule......Page 262
16.4.2 Search Strategy......Page 263
16.6 Self-assessment Exercises for Chapter 16......Page 266
17.1 Introduction......Page 268
17.2 Transactions and Itemsets......Page 269
17.3 Support for an Itemset......Page 270
17.4 Association Rules......Page 271
17.5 Generating Association Rules......Page 273
17.6 Apriori......Page 274
17.7 Generating Supported Itemsets: An Example......Page 277
17.8 Generating Rules for a Supported Itemset......Page 279
17.9 Rule Interestingness Measures: Lift and Leverage......Page 281
17.10 Chapter Summary......Page 283
17.11 Self-assessment Exercises for Chapter 17......Page 284
18.1 Introduction: FP-Growth......Page 286
18.2.1 Pre-processing the Transaction Database......Page 289
18.2.2 Initialisation......Page 291
18.2.3 Processing Transaction 1: f, c, a, m, p......Page 292
18.2.4 Processing Transaction 2: f, c, a, b, m......Page 294
18.2.5 Processing Transaction 3: f, b......Page 298
18.2.6 Processing Transaction 4: c, b, p......Page 300
18.2.7 Processing Transaction 5: f, c, a, m, p......Page 302
18.3 Finding the Frequent Itemsets from the FP-tree......Page 303
18.3.1 Itemsets Ending with Item p......Page 306
18.3.2 Itemsets Ending with Item m......Page 316
18.4 Chapter Summary......Page 323
18.5 Self-assessment Exercises for Chapter 18......Page 324
19.1 Introduction......Page 326
19.2 k-Means Clustering......Page 329
19.2.1 Example......Page 330
19.2.2 Finding the Best Set of Clusters......Page 334
19.3 Agglomerative Hierarchical Clustering......Page 335
19.3.1 Recording the Distance Between Clusters......Page 338
19.3.2 Terminating the Clustering Process......Page 341
19.5 Self-assessment Exercises for Chapter 19......Page 342
20.1 Multiple Classi.cations......Page 344
20.2 Representing Text Documents for Data Mining......Page 345
20.3 Stop Words and Stemming......Page 347
20.5 Representing Text Documents: Constructing a Vector Space Model......Page 348
20.6 Normalising the Weights......Page 350
20.7 Measuring the Distance Between Two Vectors......Page 351
20.8 Measuring the Performance of a Text Classi.er......Page 352
20.9.1 Classifying Web Pages......Page 353
20.9.2 Hypertext Classi.cation versus Text Classi.cation......Page 354
20.11 Self-assessment Exercises for Chapter 20......Page 358
A Essential Mathematics......Page 360
B Datasets......Page 376
C Sources of Further Information......Page 398
D Glossary and Notation......Page 402
E Solutions to Self-assessment Exercises......Page 422
Index......Page 450