Data clustering is a highly interdisciplinary field, the goal of which is to divide a set of objects into homogeneous groups such that objects in the same group are similar and objects in different groups are quite distinct. Thousands of theoretical papers and a number of books on data clustering have been published over the past 50 years. However, few books exist to teach people how to implement data clustering algorithms. This book was written for anyone who wants to implement or improve their data clustering algorithms.
Using object-oriented design and programming techniques, Data Clustering in C++ exploits the commonalities of all data clustering algorithms to create a flexible set of reusable classes that simplifies the implementation of any data clustering algorithm. Readers can follow the development of the base data clustering classes and several popular data clustering algorithms. Additional topics such as data pre-processing, data visualization, cluster visualization, and cluster interpretation are briefly covered.
This book is divided into three parts--
Data Clustering and C++ Preliminaries: A review of basic concepts of data clustering, the unified modeling language, object-oriented programming in C++, and design patterns
A C++ Data Clustering Framework: The development of data clustering base classes
Data Clustering Algorithms: The implementation of several popular data clustering algorithms
A key to learning a clustering algorithm is to implement and experiment the clustering algorithm. Complete listings of classes, examples, unit test cases, and GNU configuration files are included in the appendices of this book as well as in the downloadable resources. The only requirements to compile the code are a modern C++ compiler and the Boost C++ libraries.
Author(s): Guojun Gan
Publisher: CRC Press
Year: 2011
Language: English
Pages: 520
Data Clustering in C++ - An Object Oriented Approach
Contents
List of Figures
List of Tables
Preface
I. Data Clustering and C++ Preliminaries
1. Introduction to Data Clustering
1.1 Data Clustering
1.1.1 Clustering versus Classification
1.1.2 Definition of Clusters
1.2 Data Types
1.3 Dissimilarity and Similarity Measures
1.3.1 Measures for Continuous Data
1.3.2 Measures for Discrete Data
1.3.3 Measures for Mixed-Type Data
1.4 Hierarchical Clustering Algorithms
1.4.1 Agglomerative Hierarchical Algorithms
1.4.2 Divisive Hierarchical Algorithms
1.4.3 Other Hierarchical Algorithms
1.4.4 Dendrograms
1.5 Partitional Clustering Algorithms
1.5.1 Center-Based Clustering Algorithms
1.5.2 Search-BasedClustering Algorithms
1.5.3 Graph-Based Clustering Algorithms
1.5.4 Grid-Based Clustering Algorithms
1.5.5 Density-Based Clustering Algorithms
1.5.6 Model-Based Clustering Algorithms
1.5.7 Subspace Clustering Algorithms
1.5.8 Neural Network-Based Clustering Algorithms
1.5.9 Fuzzy Clustering Algorithms
1.6 Cluster Validity
1.7 Clustering Applications
1.8 Literature of Clustering Algorithms
1.8.1 Books on Data Clustering
1.8.2 Surveys on Data Clustering
1.9 Summary
2. The Unified Modeling Language
2.1 Package Diagrams
2.2 Class Diagrams
2.3 Use Case Diagrams
2.4 Activity Diagrams
2.5 Notes
2.6 Summary
3. Object-Oriented Programming and C++
3.1 Object-Oriented Programming
3.2 The C++ Programming Language
3.3 Encapsulation
3.4 Inheritance
3.5 Polymorphism
3.5.1 Dynamic Polymorphism
3.5.2 Static Polymorphism
3.6 Exception Handling
3.7 Summary
4. Design Patterns
4.1 Singleton
4.2 Composite
4.3 Prototype
4.4 Strategy
4.5 Template Method
4.6 Visitor
4.7 Summary
5. C++ Libraries and Tools
5.1 The Standard Template Library
5.1.1 Containers
5.1.2 Iterators
5.1.3 Algorithms
5.2 Boost C++ Libraries
5.2.1 Smart Pointers
5.2.2 Variant
5.2.3 Variant versus Any
5.2.4 Tokenizer
5.2.5 Unit Test Framework
5.3 GNU Build System
5.3.1 Autoconf
5.3.2 Automake
5.3.3 Libtool
5.3.4 Using GNU Autotools
5.4 Cygwin
5.5 Summary
II. A C++ Data Clustering Framework
6. The Clustering Library
6.1 Directory Structure and Filenames
6.2 Specification Files
6.2.1 configure.ac
6.2.2 Makefile.am
6.3 Macros and typedef Declarations
6.4 Error Handling
6.5 Unit Testing
6.6 Compilation and Installation
6.7 Summary
7. Datasets
7.1 Attributes
7.1.1 The Attribute Value Class
7.1.2 The Base Attribute Information Class
7.1.3 The Continuous Attribute Information Class
7.1.4 The Discrete Attribute Information Class
7.2 Records
7.2.1 The Record Class
7.2.2 The Schema Class
7.3 Datasets
7.4 A Dataset Example
7.5 Summary
8. Clusters
8.1 Clusters
8.2 Partitional Clustering
8.3 Hierarchical Clustering
8.4 Summary
9. Dissimilarity Measures
9.1 The Distance Base Class
9.2 Minkowski Distance
9.3 Euclidean Distance
9.4 Simple Matching Distance
9.5 Mixed Distance
9.6 Mahalanobis Distance
9.7 Summary
10. Clustering Algorithms
10.1 Arguments
10.2 Results
10.3 Algorithms
10.4 A Dummy Clustering Algorithm
10.5 Summary
11. Utility Classes
11.1 The Container Class
11.2 The Double-Key Map Class
11.3 The Dataset Adapters
11.3.1 A CSV Dataset Reader
11.3.2 A Dataset Generator
11.3.3 A Dataset Normalizer
11.4 The Node Visitors
11.4.1 The Join Value Visitor
11.4.2 The Partition Creation Visitor
11.5 The Dendrogram Class
11.6 The Dendrogram Visitor
11.7 Summary
III. Data Clustering Algorithms
12. Agglomerative Hierarchical Algorithms
12.1 Description of the Algorithm
12.2 Implementation
12.2.1 The Single Linkage Algorithm
12.2.2 The Complete Linkage Algorithm
12.2.3 The Group Average Algorithm
12.2.4 The Weighted Group Average Algorithm
12.2.5 The Centroid Algorithm
12.2.6 TheMedian Algorithm
12.2.7 Ward's Algorithm
12.3 Examples
12.3.1 The Single Linkage Algorithm
12.3.2 The Complete Linkage Algorithm
12.3.3 The Group Average Algorithm
12.3.4 The Weighted Group Average Algorithm
12.3.5 The Centroid Algorithm
12.3.6 TheMedian Algorithm
12.3.7 Ward's Algorithm
12.4 Summary
13. DIANA
13.1 Description of the Algorithm
13.2 Implementation
13.3 Examples
13.4 Summary
14. The k-means Algorithm
14.1 Description of the Algorithm
14.2 Implementation
14.3 Examples
14.4 Summary
15. The c-means Algorithm
15.1 Description of the Algorithm
15.2 Implementaion
15.3 Examples
15.4 Summary
16. The k-prototypes Algorithm
16.1 Description of the Algorithm
16.2 Implementation
16.3 Examples
16.4 Summary
17. The Genetic k-modes Algorithm
17.1 Description of the Algorithm
17.2 Implementation
17.3 Examples
17.4 Summary
18. The FSC Algorithm
18.1 Description of the Algorithm
18.2 Implementation
18.3 Examples
18.4 Summary
19. The Gaussian Mixture Algorithm
19.1 Description of the Algorithm
19.2 Implementation
19.3 Examples
19.4 Summary
20. A Parallel k-means Algorithm
20.1 Message Passing Interface
20.2 Description of the Algorithm
20.3 Implementation
20.4 Examples
20.5 Summary
A. Exercises and Projects
B. Listings
B.1 Files in Folder ClusLib
B.1.1 Configuration File configure.ac
B.1.2 m4 Macro File acinclude.m4
B.1.3 Makefile
B.2 Files in Folder cl
B.2.1 Makefile
B.2.2 Macros and typedef Declarations
B.2.3 Class Error
B.3 Files in Folder cl/algorithms
B.3.1 Makefile
B.3.2 Class Algorithm
B.3.3 Class Average
B.3.4 Class Centroid
B.3.5 Class Cmean
B.3.6 Class Complete
B.3.7 Class Diana
B.3.8 Class FSC
B.3.9 Class GKmode
B.3.10 Class GMC
B.3.11 Class Kmean
B.3.12 Class Kprototype
B.3.13 Class LW
B.3.14 Class Median
B.3.15 Class Single
B.3.16 Class Ward
B.3.17 Class Weighted
B.4 Files in Folder cl/clusters
B.4.1 Makefile
B.4.2 Class CenterCluster
B.4.3 Class Cluster
B.4.4 Class HClustering
B.4.5 Class PClustering
B.4.6 Class SubspaceCluster
B.5 Files in Folder cl/datasets
B.5.1 Makefile
B.5.2 Class AttrValue
B.5.3 Class AttrInfo
B.5.4 Class CAttrInfo
B.5.5 Class DAttrInfo
B.5.6 Class Record
B.5.7 Class Schema
B.5.8 Class Dataset
B.6 Files in Folder cl/distances
B.6.1 Makefile
B.6.2 Class Distance
B.6.3 Class EuclideanDistance
B.6.4 Class MahalanobisDistance
B.6.5 Class MinkowskiDistance
B.6.6 Class MixedDistance
B.6.7 Class SimpleMatchingDistance
B.7 Files in Folder cl/patterns
B.7.1 Makefile
B.7.2 Class DendrogramVisitor
B.7.3 Class InternalNode
B.7.4 Class LeafNode
B.7.5 Class Node
B.7.6 Class NodeVisitor
B.7.7 Class JoinValueVisitor
B.7.8 Class PCVisitor
B.8 Files in Folder cl/utilities
B.8.1 Makefile
B.8.2 Class Container
B.8.3 Class DataAdapter
B.8.4 Class DatasetGenerator
B.8.5 Class DatasetNormalizer
B.8.6 Class DatasetReader
B.8.7 Class Dendrogram
B.8.8 Class nnMap
B.8.9 Matrix Functions
B.8.10 Null Types
B.9 Files in Folder examples
B.9.1 Makefile
B.9.2 Agglomerative Hierarchical Algorithms
B.9.3 A Divisive Hierarchical Algorithm
B.9.4 The k-means Algorithm
B.9.5 The c-means Algorithm
B.9.6 The k-prototypes Algorithm
B.9.7 The Genetic k-modes Algorithm
B.9.8 The FSC Algorithm
B.9.9 The Gaussian Mixture Clustering Algorithm
B.9.10 A Parallel k-means Algorithm
B.10 Files in Folder test-suite
B.10.1 Makefile
B.10.2 The Master Test Suite
B.10.3 Test of AttrInfo
B.10.4 Test of Dataset
B.10.5 Test of Distance
B.10.6 Test of nnMap
B.10.7 Test of Matrices
B.10.8 Test of Schema
C. Software
C.1 An Introduction to Makefiles
C.1.1 Rules
C.1.2 Variables
C.2 Installing Boost
C.2.1 Boost for Windows
C.2.2 Boost for Cygwin or Linux
C.3 Installing Cygwin
C.4 Installing GMP
C.5 Installing MPICH2 and Boost MPI
Bibliography