Data Clustering in C++: An Object-Oriented Approach

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Data clustering is a highly interdisciplinary field, the goal of which is to divide a set of objects into homogeneous groups such that objects in the same group are similar and objects in different groups are quite distinct. Thousands of theoretical papers and a number of books on data clustering have been published over the past 50 years. However, few books exist to teach people how to implement data clustering algorithms. This book was written for anyone who wants to implement or improve their data clustering algorithms. Using object-oriented design and programming techniques, Data Clustering in C++ exploits the commonalities of all data clustering algorithms to create a flexible set of reusable classes that simplifies the implementation of any data clustering algorithm. Readers can follow the development of the base data clustering classes and several popular data clustering algorithms. Additional topics such as data pre-processing, data visualization, cluster visualization, and cluster interpretation are briefly covered. This book is divided into three parts-- Data Clustering and C++ Preliminaries: A review of basic concepts of data clustering, the unified modeling language, object-oriented programming in C++, and design patterns A C++ Data Clustering Framework: The development of data clustering base classes Data Clustering Algorithms: The implementation of several popular data clustering algorithms A key to learning a clustering algorithm is to implement and experiment the clustering algorithm. Complete listings of classes, examples, unit test cases, and GNU configuration files are included in the appendices of this book as well as in the CD-ROM of the book. The only requirements to compile the code are a modern C++ compiler and the Boost C++ libraries.

Author(s): Guojun Gan
Series: Chapman & Hall CRC Data Mining and Knowledge Discovery
Edition: Har/Cdr
Publisher: Chapman and Hall/CRC
Year: 2011

Language: English
Pages: 497

Data Clustering in C++ - An Object Oriented Approach......Page 1
Contents......Page 7
List of Figures......Page 15
List of Tables......Page 18
Preface......Page 20
I. Data Clustering and C++ Preliminaries......Page 24
1.1 Data Clustering......Page 25
1.1.1 Clustering versus Classification......Page 26
1.1.2 Definition of Clusters......Page 27
1.2 Data Types......Page 29
1.3 Dissimilarity and Similarity Measures......Page 30
1.3.1 Measures for Continuous Data......Page 31
1.3.3 Measures for Mixed-Type Data......Page 32
1.4 Hierarchical Clustering Algorithms......Page 33
1.4.1 Agglomerative Hierarchical Algorithms......Page 34
1.4.3 Other Hierarchical Algorithms......Page 36
1.5 Partitional Clustering Algorithms......Page 37
1.5.1 Center-Based Clustering Algorithms......Page 39
1.5.2 Search-BasedClustering Algorithms......Page 40
1.5.3 Graph-Based Clustering Algorithms......Page 41
1.5.5 Density-Based Clustering Algorithms......Page 42
1.5.6 Model-Based Clustering Algorithms......Page 43
1.5.8 Neural Network-Based Clustering Algorithms......Page 44
1.6 Cluster Validity......Page 45
1.7 Clustering Applications......Page 46
1.8.1 Books on Data Clustering......Page 47
1.8.2 Surveys on Data Clustering......Page 48
1.9 Summary......Page 50
2.1 Package Diagrams......Page 51
2.2 Class Diagrams......Page 54
2.3 Use Case Diagrams......Page 58
2.4 Activity Diagrams......Page 60
2.5 Notes......Page 61
2.6 Summary......Page 62
3.1 Object-Oriented Programming......Page 63
3.2 The C++ Programming Language......Page 64
3.3 Encapsulation......Page 67
3.4 Inheritance......Page 70
3.5 Polymorphism......Page 72
3.5.1 Dynamic Polymorphism......Page 73
3.5.2 Static Polymorphism......Page 74
3.6 Exception Handling......Page 76
3.7 Summary......Page 78
4. Design Patterns......Page 79
4.1 Singleton......Page 80
4.2 Composite......Page 83
4.3 Prototype......Page 86
4.4 Strategy......Page 89
4.5 Template Method......Page 91
4.6 Visitor......Page 94
4.7 Summary......Page 97
5.1.1 Containers......Page 98
5.1.2 Iterators......Page 103
5.1.3 Algorithms......Page 105
5.2 Boost C++ Libraries......Page 107
5.2.1 Smart Pointers......Page 108
5.2.2 Variant......Page 110
5.2.3 Variant versus Any......Page 111
5.2.4 Tokenizer......Page 113
5.2.5 Unit Test Framework......Page 114
5.3 GNU Build System......Page 116
5.3.1 Autoconf......Page 117
5.3.3 Libtool......Page 118
5.4 Cygwin......Page 119
5.5 Summary......Page 120
II. A C++ Data Clustering Framework......Page 121
6.1 Directory Structure and Filenames......Page 122
6.2.1 configure.ac......Page 124
6.2.2 Makefile.am......Page 125
6.3 Macros and typedef Declarations......Page 128
6.4 Error Handling......Page 130
6.5 Unit Testing......Page 131
6.6 Compilation and Installation......Page 132
6.7 Summary......Page 133
7.1.1 The Attribute Value Class......Page 134
7.1.2 The Base Attribute Information Class......Page 136
7.1.3 The Continuous Attribute Information Class......Page 138
7.1.4 The Discrete Attribute Information Class......Page 139
7.2.1 The Record Class......Page 141
7.2.2 The Schema Class......Page 143
7.3 Datasets......Page 144
7.4 A Dataset Example......Page 146
7.5 Summary......Page 149
8.1 Clusters......Page 150
8.2 Partitional Clustering......Page 152
8.3 Hierarchical Clustering......Page 154
8.4 Summary......Page 157
9.1 The Distance Base Class......Page 158
9.2 Minkowski Distance......Page 159
9.3 Euclidean Distance......Page 160
9.4 Simple Matching Distance......Page 161
9.5 Mixed Distance......Page 162
9.6 Mahalanobis Distance......Page 163
9.7 Summary......Page 166
10.1 Arguments......Page 167
10.2 Results......Page 168
10.3 Algorithms......Page 169
10.4 A Dummy Clustering Algorithm......Page 172
10.5 Summary......Page 176
11.1 The Container Class......Page 178
11.2 The Double-Key Map Class......Page 181
11.3.1 A CSV Dataset Reader......Page 184
11.3.2 A Dataset Generator......Page 187
11.3.3 A Dataset Normalizer......Page 190
11.4.1 The Join Value Visitor......Page 192
11.4.2 The Partition Creation Visitor......Page 193
11.5 The Dendrogram Class......Page 194
11.6 The Dendrogram Visitor......Page 196
11.7 Summary......Page 197
III. Data Clustering Algorithms......Page 200
12.1 Description of the Algorithm......Page 201
12.2 Implementation......Page 203
12.2.2 The Complete Linkage Algorithm......Page 208
12.2.3 The Group Average Algorithm......Page 209
12.2.5 The Centroid Algorithm......Page 210
12.2.6 TheMedian Algorithm......Page 211
12.2.7 Ward's Algorithm......Page 212
12.3 Examples......Page 213
12.3.1 The Single Linkage Algorithm......Page 214
12.3.2 The Complete Linkage Algorithm......Page 216
12.3.3 The Group Average Algorithm......Page 218
12.3.4 The Weighted Group Average Algorithm......Page 220
12.3.5 The Centroid Algorithm......Page 223
12.3.6 TheMedian Algorithm......Page 226
12.3.7 Ward's Algorithm......Page 228
12.4 Summary......Page 230
13.1 Description of the Algorithm......Page 232
13.2 Implementation......Page 233
13.3 Examples......Page 238
13.4 Summary......Page 242
14.1 Description of the Algorithm......Page 243
14.2 Implementation......Page 244
14.3 Examples......Page 249
14.4 Summary......Page 254
15.1 Description of the Algorithm......Page 255
15.2 Implementaion......Page 256
15.3 Examples......Page 260
15.4 Summary......Page 267
16.1 Description of the Algorithm......Page 268
16.2 Implementation......Page 269
16.3 Examples......Page 271
16.4 Summary......Page 276
17.1 Description of the Algorithm......Page 277
17.2 Implementation......Page 279
17.3 Examples......Page 286
17.4 Summary......Page 289
18.1 Description of the Algorithm......Page 291
18.2 Implementation......Page 293
18.3 Examples......Page 296
18.4 Summary......Page 302
19.1 Description of the Algorithm......Page 303
19.2 Implementation......Page 305
19.3 Examples......Page 312
19.4 Summary......Page 318
20.1 Message Passing Interface......Page 319
20.2 Description of the Algorithm......Page 322
20.3 Implementation......Page 323
20.4 Examples......Page 328
20.5 Summary......Page 332
A. Exercises and Projects......Page 334
B.1.1 Configuration File configure.ac......Page 336
B.1.2 m4 Macro File acinclude.m4......Page 337
B.1.3 Makefile......Page 338
B.2.2 Macros and typedef Declarations......Page 339
B.2.3 Class Error......Page 340
B.3.1 Makefile......Page 342
B.3.2 Class Algorithm......Page 343
B.3.4 Class Centroid......Page 345
B.3.5 Class Cmean......Page 346
B.3.7 Class Diana......Page 350
B.3.8 Class FSC......Page 354
B.3.9 Class GKmode......Page 358
B.3.10 Class GMC......Page 364
B.3.11 Class Kmean......Page 369
B.3.12 Class Kprototype......Page 372
B.3.13 Class LW......Page 373
B.3.14 Class Median......Page 375
B.3.15 Class Single......Page 376
B.3.16 Class Ward......Page 377
B.3.17 Class Weighted......Page 378
B.4.2 Class CenterCluster......Page 379
B.4.3 Class Cluster......Page 380
B.4.4 Class HClustering......Page 381
B.4.5 Class PClustering......Page 383
B.4.6 Class SubspaceCluster......Page 386
B.5.2 Class AttrValue......Page 387
B.5.3 Class AttrInfo......Page 388
B.5.4 Class CAttrInfo......Page 390
B.5.5 Class DAttrInfo......Page 392
B.5.6 Class Record......Page 395
B.5.7 Class Schema......Page 397
B.5.8 Class Dataset......Page 399
B.6.2 Class Distance......Page 403
B.6.3 Class EuclideanDistance......Page 404
B.6.4 Class MahalanobisDistance......Page 405
B.6.5 Class MinkowskiDistance......Page 406
B.6.6 Class MixedDistance......Page 407
B.6.7 Class SimpleMatchingDistance......Page 408
B.7.1 Makefile......Page 409
B.7.2 Class DendrogramVisitor......Page 410
B.7.3 Class InternalNode......Page 412
B.7.4 Class LeafNode......Page 414
B.7.5 Class Node......Page 415
B.7.7 Class JoinValueVisitor......Page 416
B.7.8 Class PCVisitor......Page 418
B.8.1 Makefile......Page 419
B.8.2 Class Container......Page 420
B.8.4 Class DatasetGenerator......Page 422
B.8.5 Class DatasetNormalizer......Page 424
B.8.6 Class DatasetReader......Page 426
B.8.7 Class Dendrogram......Page 429
B.8.8 Class nnMap......Page 432
B.8.9 Matrix Functions......Page 434
B.8.10 Null Types......Page 436
B.9.2 Agglomerative Hierarchical Algorithms......Page 437
B.9.3 A Divisive Hierarchical Algorithm......Page 440
B.9.4 The k-means Algorithm......Page 441
B.9.5 The c-means Algorithm......Page 444
B.9.6 The k-prototypes Algorithm......Page 446
B.9.7 The Genetic k-modes Algorithm......Page 448
B.9.8 The FSC Algorithm......Page 450
B.9.9 The Gaussian Mixture Clustering Algorithm......Page 452
B.9.10 A Parallel k-means Algorithm......Page 455
B.10.1 Makefile......Page 461
B.10.3 Test of AttrInfo......Page 462
B.10.4 Test of Dataset......Page 464
B.10.5 Test of Distance......Page 465
B.10.6 Test of nnMap......Page 467
B.10.7 Test of Matrices......Page 469
B.10.8 Test of Schema......Page 470
C.1.1 Rules......Page 472
C.1.2 Variables......Page 473
C.2.1 Boost for Windows......Page 474
C.2.2 Boost for Cygwin or Linux......Page 475
C.4 Installing GMP......Page 476
C.5 Installing MPICH2 and Boost MPI......Page 477
Bibliography......Page 480