I have a great deal of experience preparing data for analysis. I was looking for a book that would add to my understanding of and enhance my organization for data preparation. This is not that book. At best, the book provides insight into the types of issues faced in preparing data and emphasizes the value of such. Rather than criticize, I wish to foreworn those who have already practiced at a somewhat rigorous level (more than five semesters of statistics/data mining) that this might not be what you are seeking.
Author(s): Dorian Pyle
Series: The Morgan Kaufmann Series in Data Management Systems
Edition: Book & CD-ROM 1st
Publisher: Morgan Kaufmann
Year: 1999
Language: English
Pages: 466
Data Preparation for Data Mining......Page 1
Dedication......Page 2
Table of Contents......Page 3
Who This Book Is For......Page 4
Special Features......Page 5
Acknowledgments......Page 6
Knowledge, Power, Data, and the World......Page 7
Mining Data for Information......Page 8
Preparing the Data, Preparing the Miner......Page 9
Is This Book for You?......Page 10
Organization......Page 11
Back to the Future......Page 12
Overview......Page 14
1.1 The Data Exploration Process......Page 15
1.1.1 Stage 1: Exploring the Problem Space......Page 16
1.1.2 Stage 2: Exploring the Solution Space......Page 22
1.1.3 Stage 3: Specifying the Implementation Method......Page 24
1.1.4 Stage 4: Mining the Data......Page 25
1.1.5 Exploration: Mining and Modeling......Page 30
1.2.1 Ten Golden Rules......Page 31
1.2.2 Introducing Modeling Tools......Page 32
1.2.3 Types of Models......Page 34
1.2.5 Explanatory and Predictive Models......Page 36
1.2.6 Static and Continuously Learning Models......Page 37
Supplemental MaterialA Continuously Learning Model Application......Page 39
How the Continuously Learning Model Worked......Page 40
2.1 Measuring the World......Page 45
2.1.1 Objects......Page 46
2.1.3 Errors of Measurement......Page 47
2.2 Types of Measurements......Page 52
2.2.1 Scalar Measurements......Page 53
2.2.2 Nonscalar Measurements......Page 58
2.3 Continua of Attributes of Variables......Page 59
2.3.2 The Discrete-Continuous Continuum......Page 60
2.4 Scale Measurement Example......Page 64
2.5 Transformations and Difficulties—Variables, Data, and Information......Page 65
2.6.1 Data Representation......Page 66
2.6.2 Building Data—Dealing with Variables......Page 67
2.6.3 Building Mineable Data Sets......Page 76
2.7 Summary......Page 84
Supplemental MaterialCombinations......Page 85
Overview......Page 87
3.1 Data Preparation: Inputs, Outputs, Models, and Decisions......Page 88
3.1.1 Step 1: Prepare the Data......Page 89
3.1.3 Step 3: Model the Data......Page 94
3.1.4 Use the Model......Page 95
3.2 Modeling Tools and Data Preparation......Page 96
3.2.1 How Modeling Tools Drive Data Preparation......Page 97
3.2.3 Decision Lists......Page 99
3.2.4 Neural Networks......Page 100
3.2.6 Modeling Data with the Tools......Page 101
3.2.7 Predictions and Rules......Page 102
3.3 Stages of Data Preparation......Page 104
3.3.1 Stage 1: Accessing the Data......Page 105
3.3.3 Stage 3: Enhancing and Enriching the Data......Page 106
3.3.5 Stage 5: Determining Data Structure (Super-, Macro-, and Micro-)......Page 107
3.3.6 Stage 6: Building the PIE......Page 108
3.3.8 Stage 8: Modeling the Data......Page 113
3.4 And the Result Is . . . ?......Page 114
Overview......Page 116
4.1 Data Discovery......Page 117
4.1.1 Data Access Issues......Page 118
4.2.1 Detail/Aggregation Level (Granularity)......Page 120
4.2.2 Consistency......Page 121
4.2.4 Objects......Page 122
4.2.7 Defaults......Page 123
4.2.10 Duplicate or Redundant Variables......Page 124
4.3.1 Reverse Pivoting......Page 125
4.3.2 Feature Extraction......Page 126
4.3.4 Explanatory Structure......Page 127
4.3.5 Data Enhancement or Enrichment......Page 128
4.3.6 Sampling Bias......Page 129
4.4.1 Looking at the Variables......Page 130
4.4.2 Relationships between Variables......Page 136
4.5.1 Looking at the Variables......Page 138
4.5.2 Relationships between Variables......Page 141
4.6 The Data Assay......Page 142
5.1.1 How Much Data?......Page 144
5.1.2 Variability......Page 145
5.1.4 Measuring Variability......Page 149
5.1.5 Variability and Deviation......Page 150
5.3 Variability of Numeric Variables......Page 154
5.3.1 Variability and Sampling......Page 155
5.3.2 Variability and Convergence......Page 156
5.4 Variability and Confidence in Alpha Variables......Page 157
5.4.1 Ordering and Rate of Discovery......Page 158
5.5 Measuring Confidence......Page 159
5.5.2 Testing for Confidence......Page 160
5.5.3 Confidence Tests and Variability......Page 163
5.6 Confidence in Capturing Variability......Page 165
5.6.1 A Brief Introduction to the Normal Distribution......Page 166
5.6.2 Normally Distributed Probabilities......Page 167
5.6.3 Capturing Normally Distributed Probabilities: An Example......Page 169
5.6.4 Capturing Confidence, Capturing Variance......Page 170
5.7.1 Missing Values......Page 171
5.7.3 Problems with Sampling......Page 172
5.7.4 Monotonic Variable Detection......Page 173
5.8 Confidence and Instance Count......Page 174
Supplemental MaterialConfidence Samples......Page 175
Overview......Page 177
6.1 Representing Alphas and Remapping......Page 178
6.1.1 One-of-......Page 179
6.1.2......Page 180
6.1.3 Remapping to Eliminate Ordering......Page 181
6.1.4 Remapping One-to-Many Patterns, or Ill-Formed Problems......Page 182
6.1.5 Remapping Circular Discontinuity......Page 185
6.2.1 Unit State Space......Page 187
6.2.3 Position in State Space......Page 188
6.2.4 Neighbors and Associates......Page 189
6.2.5 Density and Sparsity......Page 191
6.2.6 Nearby and Distant Nearest Neighbors......Page 194
6.2.8 Contours, Peaks, and Valleys......Page 195
6.2.10 Objects in State Space......Page 196
6.2.12 Mapping Alpha Values......Page 197
6.2.13 Location, Location, Location!......Page 198
6.2.14 Numerics, Alphas, and the Montreal Canadiens......Page 199
6.3.1 Two-Way Tables......Page 208
6.3.2 More Values, More Variables, and Meaning of the Numeration......Page 216
6.4 Dimensionality......Page 217
6.4.2 Squashing a Triangle......Page 218
6.4.4 Scree Plots......Page 221
6.6 Summary......Page 222
7.1 Normalizing a Variable’s Range......Page 224
7.1.2 The Nature and Scope of the Out-of-Range Values......Page 226
Problem......Page 227
7.1.3 Discovering the Range of Values When Building the PIE......Page 228
7.1.4 Out-of-Range Values When Training......Page 232
7.1.6 Out-of-Range Values When Executing......Page 234
7.1.7 Scaling Transformations......Page 235
7.1.8 Softmax Scaling......Page 241
7.1.9 Normalizing Ranges......Page 242
7.2.1 The Nature of Distributions......Page 243
7.2.2 Distributive Difficulties......Page 244
7.2.3 Adjusting Distributions......Page 245
7.2.4 Modified Distributions......Page 248
7.3 Summary......Page 251
Supplemental MaterialThe Logistic Function......Page 252
Modifying the Linear Part of the Logistic Function Range......Page 256
8.1 Retaining Information about Missing Values......Page 257
8.1.2 Capturing Patterns......Page 258
8.2.1 Unbiased Estimators......Page 260
8.2.2 Variability Relationships......Page 261
8.2.3 Relationships between Variables......Page 263
8.2.4 Preserving Between-Variable Relationships......Page 265
Supplemental MaterialUsing Regression to Find Least Information-Damaging Missing Values......Page 267
Alternative Methods of Missing-Value Replacement......Page 276
Overview......Page 279
9.2 Types of Series......Page 280
9.3.1 Constructing a Series......Page 281
9.3.3 Describing a Series—Fourier......Page 282
9.3.4 Describing a Series—Spectrum......Page 286
9.3.5 Describing a Series—Trend, Seasonality, Cycles, Noise......Page 291
9.3.6 Describing a Series—Autocorrelation......Page 293
9.4 Modeling Series Data......Page 295
9.5.1 Missing Values......Page 296
9.5.3 Nonuniform Displacement......Page 297
9.5.4 Trend......Page 298
9.6.1 Filtering......Page 300
9.6.2 Moving Averages......Page 302
9.6.3 Smoothing 1—PVM Smoothing......Page 308
9.6.4 Smoothing 2—Median Smoothing, Resmoothing, and Hanning......Page 309
9.6.5 Extraction......Page 310
9.6.6 Differencing......Page 311
9.7 Other Problems......Page 314
9.7.2 Distribution......Page 315
9.7.1 Numerating Alpha Values......Page 317
9.7.2 Distribution......Page 318
9.8 Preparing Series Data......Page 320
9.8.2 Signposts on the Rocky Road......Page 322
9.9 Implementation Notes......Page 324
10.1 Using Sparsely Populated Variables......Page 326
10.1.2 Binning Sparse Numerical Values......Page 327
10.1.3 Present-Value Patterns (PVPs)......Page 328
10.2 Problems with High-Dimensionality Data Sets......Page 329
10.2.1 Information Representation......Page 331
10.2.2 Representing High-Dimensionality Data in Fewer Dimensions......Page 332
10.3 Introducing the Neural Network......Page 334
10.3.1 Training a Neural Network......Page 335
10.3.3 Reshaping the Logistic Curve......Page 336
10.3.4 Single-Input Neurons......Page 337
10.3.5 Multiple-Input Neurons......Page 339
10.3.6 Networking Neurons to Estimate a Function......Page 340
10.3.7 Network Learning......Page 342
10.3.9 Network Prediction—Output Layer......Page 343
10.3.10 Stochastic Network Performance......Page 344
10.3.11 Network Architecture 1—The Autoassociative Network......Page 345
10.3.12 Network Architecture 2—The Sparsely Connected Network......Page 346
10.4 Compressing Variables......Page 347
10.4.1 Using Compressed Dimensionality Data......Page 348
10.5 Removing Variables......Page 349
10.5.1 Estimating Variable Importance 1: What Doesn’t Work......Page 350
10.5.2 Estimating Variable Importance 2: Clues......Page 351
10.5.3 Estimating Variable Importance 3: Configuring and Training the Network......Page 352
10.6 How Much Data Is Enough?......Page 354
10.6.1 Joint Distribution......Page 355
10.6.2 Capturing Joint Variability......Page 359
10.6.3 Degrees of Freedom......Page 360
10.7 Beyond Joint Distribution......Page 361
10.7.1 Enhancing the Data Set......Page 362
10.7.2 Data Sets in Perspective......Page 365
10.8.1 Collapsing Extremely Sparsely Populated Variables......Page 366
10.8.4 Feature Enhancement......Page 367
10.9 Where Next?......Page 368
Overview......Page 369
11.1 Introduction to the Data Survey......Page 370
11.2 Information and Communication......Page 371
11.2.2 Measuring Information: Signals......Page 373
11.2.3 Measuring Information: Bits of Information......Page 375
11.2.4 Measuring Information: Surprise......Page 378
11.2.6 Measuring Information: Dictionaries......Page 380
11.3 Mapping Using Entropy......Page 382
11.3.1 Whole Data Set Entropy......Page 385
11.3.2 Conditional Entropy between Inputs and Outputs......Page 386
11.3.4 Other Survey Uses for Entropy and Information......Page 388
11.3.5 Looking for Information......Page 389
11.4 Identifying Problems with a Data Survey......Page 390
11.4.1 Confidence and Sufficient Data......Page 392
11.4.2 Detecting Sparsity......Page 393
11.4.3 Manifold Definition......Page 394
11.5 Clusters......Page 401
11.6 Sampling Bias......Page 402
11.7 Making the Data Survey......Page 405
11.8 Novelty Detection......Page 407
11.9 Other Directions......Page 408
Supplemental MaterialEntropic Analysis—Example......Page 410
Surveying Data Sets......Page 416
Overview......Page 442
12.1.1 Assumptions......Page 443
12.1.2 Models......Page 444
12.1.3 Data Mining vs. Exploratory Data Analysis......Page 445
12.2 Characterizing Data......Page 447
12.2.1 Decision Trees......Page 448
12.2.2 Clusters......Page 449
12.2.4 Neural Networks and Regression......Page 450
12.3 Prepared Data and Modeling Algorithms......Page 451
12.3.1 Neural Networks and the CREDIT Data Set......Page 452
12.3.2 Decision Trees and the CREDIT Data Set......Page 455
12.4 Practical Use of Data Preparation and Prepared Data......Page 456
12.5 Looking at Present Modeling Tools and Future Directions......Page 457
12.5.1 Near Future......Page 458
12.5.2 Farther Out......Page 459
Control Variables......Page 461
Sample Control File......Page 462
Appendix B: Further Reading......Page 464