This book focuses on dealing with large-scale data, a field commonly referred to as data mining. The book is divided into three sections. The first deals with an introduction to statistical aspects of data mining and machine learning and includes applications to text analysis, computer intrusion detection, and hiding of information in digital files. The second section focuses on a variety of statistical methodologies that have proven to be effective in data mining applications. These include clustering, classification, multivariate density estimation, tree-based methods, pattern recognition, outlier detection, genetic algorithms, and dimensionality reduction. The third section focuses on data visualization and covers issues of visualization of high-dimensional data, novel graphical techniques with a focus on human factors, interactive graphics, and data visualization using virtual reality. This book represents a thorough cross section of internationally renowned thinkers who are inventing methods for dealing with a new data paradigm. Key Features: - Distinguished contributors who are international experts in aspects of data mining - Includes data mining approaches to non-numerical data mining including text data, Internet traffic data, and geographic data - Highly topical discussions reflecting current thinking on contemporary technical issues, e.g. streaming data - Discusses taxonomy of dataset sizes, computational complexity, and scalability usually ignored in most discussions - Thorough discussion of data visualization issues blending statistical, human factors, and computational insights · Distinguished contributors who are international experts in aspects of data mining · Includes data mining approaches to non-numerical data mining including text data, Internet traffic data, and geographic data · Highly topical discussions reflecting current thinking on contemporary technical issues, e.g. streaming data · Discusses taxonomy of dataset sizes, computational complexity, and scalability usually ignored in most discussions · Thorough discussion of data visualization issues blending statistical, human factors, and computational insights
Author(s): C.R. Rao, E. J. Wegman, J. L. Solka
Series: Handbook of Statistics 24
Publisher: North Holland
Year: 2005
Language: English
Pages: 574
sdarticle.pdf......Page 1
sdarticle_001.pdf......Page 3
sdarticle_002.pdf......Page 9
Introduction......Page 11
Order of magnitude considerations......Page 12
Feasibility limits due to CPU performance......Page 14
Feasibility limits due to file transfer performance......Page 17
Feasibility limits due to visual resolution......Page 18
Knowledge discovery in databases and data mining......Page 19
Association rules......Page 21
Data preparation......Page 24
Missing values and outliers......Page 25
Quantization......Page 27
SQL......Page 29
Data cubes and OLAP......Page 30
Density estimation......Page 31
Cluster analysis......Page 34
Hierarchical clustering......Page 35
The number of groups problem......Page 36
Functioning of an artificial neural network......Page 37
Visual data mining......Page 39
Graphics constructs for visual data mining......Page 40
Example 1 - PRIM 7 data......Page 42
Example 2 - iterative denoising with hyperspectral data......Page 44
Streaming data......Page 47
Counts, moments and densities......Page 48
Waterfall diagrams and transient geographic mapping......Page 50
Block-recursive plots and conditional plots......Page 52
References......Page 54
Introduction......Page 57
Discovering rules and patterns via AQ learning......Page 59
Types of problems in learning from examples......Page 62
Clustering of entities into conceptually meaningful categories......Page 63
Automated improvement of the search space: constructive induction......Page 65
Integrating qualitative and quantitative methods of numerical discovery......Page 66
Predicting processes qualitatively......Page 67
Knowledge improvement via incremental learning......Page 68
Summarizing the logical data analysis approach......Page 69
Strong patterns vs. complete and consistent rules......Page 70
Ruleset visualization via concept association graphs......Page 72
Integration of knowledge generation operators......Page 76
Summary......Page 79
Acknowledgements......Page 80
References......Page 81
Introduction......Page 86
Overview of networking......Page 87
The threat......Page 93
Probes and scans......Page 94
Denial of service attacks......Page 95
Gaining access......Page 100
Network monitoring......Page 101
TCP sessions......Page 106
Signatures versus anomalies......Page 110
User profiling......Page 111
Program profiling......Page 113
References......Page 116
Introduction and background......Page 118
Hidden Markov models......Page 119
Probabilistic context-free grammars......Page 121
Supervised disambiguation......Page 124
Unsupervised disambiguation......Page 125
Generic implementation.......Page 128
Using term weights.......Page 130
The bigram proximity matrix......Page 131
Matching coefficient......Page 132
Document classification via supervised learning.......Page 133
Document classification via model-based clustering.......Page 134
Towards knowledge discovery......Page 136
Summary......Page 138
References......Page 139
Approach......Page 141
Feature extraction......Page 148
Automated serendipity extraction on the Science News data set with no user driven focus of attention......Page 149
Automated serendipity extraction on the ONR ILIR data set with no user driven focus of attention......Page 153
Automated serendipity extraction on the Science News data set with user driven focus of attention......Page 157
Clustering results on the ONR ILIR dataset......Page 165
Clustering results on the Science News dataset......Page 173
Conclusions......Page 176
References......Page 177
Introduction......Page 178
Image formats......Page 179
Steganography......Page 181
Embedding by modifying carrier bits......Page 182
Embedding using pairs of values......Page 185
Steganalysis......Page 186
Relationship of steganography to watermarking......Page 188
Literature survey......Page 191
References......Page 193
Introduction......Page 195
Mahalanobis space......Page 196
Canonical coordinates......Page 197
Canonical coordinates for profiles......Page 198
Canonical coordinates for variables......Page 199
Loss of information due to dimensionality reduction......Page 200
An example......Page 201
Preprocessing of data......Page 203
V (variable) plot.......Page 205
PVs biplot.......Page 206
Two-way contingency tables (correspondence analysis)......Page 207
Discussion......Page 215
References......Page 216
Background......Page 218
Basics......Page 219
Practical classification rules......Page 221
Linear discriminant analysis......Page 222
Logistic discrimination......Page 223
The naive Bayes model......Page 224
The perceptron......Page 225
Tree classifiers......Page 226
Local nonparametric methods......Page 227
Neural networks......Page 228
Support vector machines......Page 229
Other approaches......Page 230
Other issues......Page 231
References......Page 232
Introduction......Page 234
Classical density estimators......Page 235
Properties of histograms......Page 236
Maximum likelihood and histograms......Page 237
L2 theory of histograms......Page 238
Practical histogram rules......Page 239
Frequency polygons......Page 242
Multivariate frequency curves......Page 243
Averaged shifted histograms......Page 244
Kernel estimators......Page 245
Multivariate kernel options......Page 247
Balloon estimators......Page 248
Sample point estimators......Page 249
Parameterization of sample-point estimators......Page 250
Estimating bandwidth matrices......Page 252
Mixture density estimation......Page 253
Fitting mixture models......Page 254
An example......Page 256
Visualization of densities......Page 257
Higher dimensions......Page 260
Curse of dimensionality......Page 262
References......Page 263
Introduction......Page 267
The need for robustness......Page 268
Description of the MCD......Page 269
The C-step......Page 270
Computational improvements......Page 271
The FAST-MCD algorithm......Page 272
Examples......Page 274
Multiple regression......Page 276
Multivariate regression......Page 282
Classification......Page 286
Classical PCA......Page 287
Robust PCA......Page 289
Example......Page 293
Selecting the number of components......Page 296
Example......Page 297
Partial Least Squares Regression......Page 300
Availability......Page 301
References......Page 304
Classification and regression trees......Page 307
Bagging and boosting......Page 309
Classification trees......Page 310
Overview of how CART creates a tree......Page 311
Determining the predicted class for a terminal node......Page 312
Selection of splits to create a partition......Page 313
Estimating the misclassification rate and selecting the right-sized tree......Page 315
Alternative approaches......Page 318
Using CART to create a regression tree......Page 319
Missing values......Page 321
Motivation for the method......Page 322
When and how bagging works......Page 324
Boosting......Page 327
AdaBoost......Page 328
Some related methods......Page 329
When and how boosting works......Page 330
References......Page 332
Introduction......Page 334
Class cover catch digraphs......Page 335
CCCD for classification......Page 337
Cluster catch digraph......Page 341
Fast algorithms......Page 343
Further enhancements......Page 346
Streaming data......Page 347
Examples using the fast algorithms......Page 349
Sloan Digital Sky Survey......Page 354
Text processing......Page 358
Acknowledgements......Page 360
References......Page 361
Introduction......Page 362
History......Page 363
Genetic algorithms......Page 364
Calculus-based schemes......Page 365
Genetic algorithms - an example......Page 366
Operational functionality of genetic algorithms......Page 368
The reproduction operator......Page 369
The crossover operator......Page 370
The mutation operator......Page 371
Encryption and other considerations......Page 372
Schemata......Page 373
Generalized penalty methods......Page 375
Multi-objective optimization......Page 378
Fuzzy logic controller......Page 379
Schema Theorem......Page 381
Windowing technique......Page 383
Elitism......Page 384
Advanced crossover techniques......Page 385
Partially mixed crossover......Page 386
Uniform order-based mutation......Page 387
Multi-parameters......Page 388
Closing remarks......Page 389
References......Page 390
Further reading......Page 391
Introduction......Page 394
Tools for constructing plane and frame interpolations: orthonormal frames and planar rotations......Page 400
Minimal subspace restriction......Page 401
Planar rotations......Page 402
Calculation and control of speed......Page 403
Outline of an algorithm for interpolation......Page 405
Interpolating paths of planes......Page 406
Interpolating paths of frames......Page 409
Orthogonal matrix paths and optimal paths for full-dimensional tours......Page 410
Givens paths......Page 411
Householder paths......Page 413
Conclusions......Page 414
References......Page 415
Introduction......Page 417
Background for quantitative graphics design......Page 419
General guidance......Page 420
Templates and GUIs......Page 421
The template for linked micromap (LM) plots......Page 422
Statistical panel variations......Page 424
Interactive extensions......Page 426
Dynamically conditioned choropleth maps......Page 429
Self-similar coordinates plots......Page 433
Closing remarks......Page 436
References......Page 437
Graphics, statistics and the computer......Page 439
Literature review......Page 440
Software review......Page 443
The interactive paradigm......Page 446
General definition......Page 448
Sample population......Page 451
Model operations......Page 454
Variable transformations......Page 455
Pair operator......Page 456
Weight operator......Page 457
Linear models......Page 458
Types of graphics......Page 459
Dotplot.......Page 460
Trace plot.......Page 461
Bar charts and pie charts.......Page 462
Mosaic plot.......Page 464
Boxplot.......Page 465
Biplot (PCA).......Page 466
Text list or variable list.......Page 467
Extensions for missing values......Page 468
Direct object manipulation......Page 469
Selection......Page 471
Selection tools......Page 472
Two-dimensional selection tools......Page 473
Selection operation......Page 474
Graphical selection......Page 475
Axes based selection......Page 476
Changing frame......Page 477
Changing graphical elements......Page 478
Changing attributes of graphical elements......Page 479
Zooming......Page 480
Changing color schemes......Page 481
Reformatting type......Page 482
Sorting data representing objects......Page 484
Reordering variables......Page 485
Adding model information......Page 486
Re-ordering scales......Page 487
Selecting individuals......Page 488
Indirect object manipulation......Page 489
Internal linking structures......Page 490
1-to-1 linking......Page 493
1-to-n linking......Page 495
Querying......Page 496
Interrogating axes......Page 497
External linking structure......Page 498
Linking frame size......Page 500
Linking models......Page 501
Linking observations......Page 502
Linking scales......Page 503
Identity linking......Page 505
Distance and neighborhood linking......Page 506
Overlaying......Page 507
Proportional highlighting......Page 508
Juxtaposition......Page 509
Linking interrogation......Page 510
Linked low-dimensional views......Page 511
Conditional probabilities......Page 513
Detecting outliers......Page 520
Clustering and classification......Page 521
Geometric structure......Page 522
Relationships......Page 524
Models with continuous response......Page 526
Models with discrete response......Page 527
Independence models......Page 532
Conclusion......Page 534
Future work......Page 535
References......Page 536
Computer graphics......Page 540
Transformation......Page 541
Color and lighting......Page 542
Graphics libraries......Page 543
Visualization......Page 544
Modeling and rendering......Page 545
Animation and simulation......Page 546
File format converters......Page 547
Data type......Page 548
Abstract data - information visualization......Page 549
Computational steering......Page 550
Parallel coordinates......Page 551
Sorting the study units.......Page 552
Linking the related elements of a study unit.......Page 553
Statistical data retrieval.......Page 554
Genetic algorithm data visualization......Page 555
Hardware and software......Page 557
Basic VR system properties......Page 558
A list of VR tools......Page 559
Basic functions in VR tool......Page 560
Some examples of visualization using VR......Page 561
References......Page 562
sdarticle_020.pdf......Page 565