Data Mining: The Textbook

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This textbook explores the different aspects of data mining from the fundamentals to the complex data types and their applications, capturing the wide diversity of problem domains for data mining issues. It goes beyond the traditional focus on data mining problems to introduce advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. Until now, no single book has addressed all these topics in a comprehensive and integrated way. The chapters of this book fall into one of three categories: Fundamental chapters: Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. These chapters comprehensively discuss a wide variety of methods for these problems. Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. Application chapters: These chapters study important applications such as stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. The domain chapters also have an applied flavor. Appropriate for both introductory and advanced data mining courses, Data Mining: The Textbook balances mathematical details and intuition. It contains the necessary mathematical details for professors and researchers, but it is presented in a simple and intuitive style to improve accessibility for students and industrial practitioners (including those with a limited mathematical background). Numerous illustrations, examples, and exercises are included, with an emphasis on semantically interpretable examples. Praise for Data Mining: The Textbook - “As I read through this book, I have already decided to use it in my classes. This is a book written by an outstanding researcher who has made fundamental contributions to data mining, in a way that is both accessible and up to date. The book is complete with theory and practical use cases. It’s a must-have for students and professors alike!" -- Qiang Yang, Chair of Computer Science and Engineering at Hong Kong University of Science and Technology "This is the most amazing and comprehensive text book on data mining. It covers not only the fundamental problems, such as clustering, classification, outliers and frequent patterns, and different data types, including text, time series, sequences, spatial data and graphs, but also various applications, such as recommenders, Web, social network and privacy. It is a great book for graduate students and researchers as well as practitioners." -- Philip S. Yu, UIC Distinguished Professor and Wexler Chair in Information Technology at University of Illinois at Chicago

Author(s): Charu C. Aggarwal
Publisher: Springer
Year: 2015

Language: English
Pages: 734

Contents
Preface
Acknowledgments
Author Biography
1 An Introduction to Data Mining
1.1 Introduction
1.2 The Data Mining Process
1.2.1 The Data Preprocessing Phase
1.2.2 The Analytical Phase
1.3 The Basic Data Types
1.3.1 Nondependency-Oriented Data
1.3.1.1 Quantitative Multidimensional Data
1.3.1.2 Categorical and Mixed Attribute Data
1.3.1.3 Binary and Set Data
1.3.1.4 Text Data
1.3.2 Dependency-Oriented Data
1.3.2.1 Time-Series Data
1.3.2.2 Discrete Sequences and Strings
1.3.2.3 Spatial Data
1.3.2.4 Network and Graph Data
1.4 The Major Building Blocks: A Bird's Eye View
1.4.1 Association Pattern Mining
1.4.2 Data Clustering
1.4.3 Outlier Detection
1.4.4 Data Classification
1.4.5 Impact of Complex Data Types on Problem Definitions
1.4.5.1 Pattern Mining with Complex Data Types
1.4.5.2 Clustering with Complex Data Types
1.4.5.3 Outlier Detection with Complex Data Types
1.4.5.4 Classification with Complex Data Types
1.5 Scalability Issues and the Streaming Scenario
1.6 A Stroll Through Some Application Scenarios
1.6.1 Store Product Placement
1.6.2 Customer Recommendations
1.6.3 Medical Diagnosis
1.6.4 Web Log Anomalies
1.7 Summary
1.8 Bibliographic Notes
1.9 Exercises
2 Data Preparation
2.1 Introduction
2.2 Feature Extraction and Portability
2.2.1 Feature Extraction
2.2.2 Data Type Portability
2.2.2.1 Numeric to Categorical Data: Discretization
2.2.2.2 Categorical to Numeric Data: Binarization
2.2.2.3 Text to Numeric Data
2.2.2.4 Time Series to Discrete Sequence Data
2.2.2.5 Time Series to Numeric Data
2.2.2.6 Discrete Sequence to Numeric Data
2.2.2.7 Spatial to Numeric Data
2.2.2.8 Graphs to Numeric Data
2.2.2.9 Any Type to Graphs for Similarity-Based Applications
2.3 Data Cleaning
2.3.1 Handling Missing Entries
2.3.2 Handling Incorrect and Inconsistent Entries
2.3.3 Scaling and Normalization
2.4 Data Reduction and Transformation
2.4.1 Sampling
2.4.1.1 Sampling for Static Data
2.4.1.2 Reservoir Sampling for Data Streams
2.4.2 Feature Subset Selection
2.4.3 Dimensionality Reduction with Axis Rotation
2.4.3.1 Principal Component Analysis
2.4.3.2 Singular Value Decomposition
2.4.3.3 Latent Semantic Analysis
2.4.3.4 Applications of PCA and SVD
2.4.4 Dimensionality Reduction with Type Transformation
2.4.4.1 Haar Wavelet Transform
2.4.4.2 Multidimensional Scaling
2.4.4.3 Spectral Transformation and Embedding of Graphs
2.5 Summary
2.6 Bibliographic Notes
2.7 Exercises
3 Similarity and Distances
3.1 Introduction
3.2 Multidimensional Data
3.2.1 Quantitative Data
3.2.1.1 Impact of Domain-Specific Relevance
3.2.1.2 Impact of High Dimensionality
3.2.1.3 Impact of Locally Irrelevant Features
3.2.1.4 Impact of Different Lp-Norms
3.2.1.5 Match-Based Similarity Computation
3.2.1.6 Impact of Data Distribution
3.2.1.7 Nonlinear Distributions: ISOMAP
3.2.1.8 Impact of Local Data Distribution
3.2.1.9 Computational Considerations
3.2.2 Categorical Data
3.2.3 Mixed Quantitative and Categorical Data
3.3 Text Similarity Measures
3.3.1 Binary and Set Data
3.4 Temporal Similarity Measures
3.4.1 Time-Series Similarity Measures
3.4.1.1 Impact of Behavioral Attribute Normalization
3.4.1.2 Lp-Norm
3.4.1.3 Dynamic Time Warping Distance
3.4.1.4 Window-Based Methods
3.4.2 Discrete Sequence Similarity Measures
3.4.2.1 Edit Distance
3.4.2.2 Longest Common Subsequence
3.5 Graph Similarity Measures
3.5.1 Similarity between Two Nodes in a Single Graph
3.5.1.1 Structural Distance-Based Measure
3.5.1.2 Random Walk-Based Similarity
3.5.2 Similarity Between Two Graphs
3.6 Supervised Similarity Functions
3.7 Summary
3.8 Bibliographic Notes
3.9 Exercises
4 Association Pattern Mining
4.1 Introduction
4.2 The Frequent Pattern Mining Model
4.3 Association Rule Generation Framework
4.4 Frequent Itemset Mining Algorithms
4.4.1 Brute Force Algorithms
4.4.2 The Apriori Algorithm
4.4.2.1 Efficient Support Counting
4.4.3 Enumeration-Tree Algorithms
4.4.3.1 Enumeration-Tree-Based Interpretation of Apriori
4.4.3.2 TreeProjection and DepthProject
4.4.3.3 Vertical Counting Methods
4.4.4 Recursive Suffix-Based Pattern Growth Methods
4.4.4.1 Implementation with Arrays but No Pointers
4.4.4.2 Implementation with Pointers but No FP-Tree
4.4.4.3 Implementation with Pointers and FP-Tree
4.4.4.4 Trade-offs with Different Data Structures
4.4.4.5 Relationship Between FP-Growth and Enumeration-Tree Methods
4.5 Alternative Models: Interesting Patterns
4.5.1 Statistical Coefficient of Correlation
4.5.2 χ2 Measure
4.5.3 Interest Ratio
4.5.4 Symmetric Confidence Measures
4.5.5 Cosine Coefficient on Columns
4.5.6 Jaccard Coefficient and the Min-hash Trick
4.5.7 Collective Strength
4.5.8 Relationship to Negative Pattern Mining
4.6 Useful Meta-algorithms
4.6.1 Sampling Methods
4.6.2 Data Partitioned Ensembles
4.6.3 Generalization to Other Data Types
4.6.3.1 Quantitative Data
4.6.3.2 Categorical Data
4.7 Summary
4.8 Bibliographic Notes
4.9 Exercises
5 Association Pattern Mining: Advanced Concepts
5.1 Introduction
5.2 Pattern Summarization
5.2.1 Maximal Patterns
5.2.2 Closed Patterns
5.2.3 Approximate Frequent Patterns
5.2.3.1 Approximation in Terms of Transactions
5.2.3.2 Approximation in Terms of Itemsets
5.3 Pattern Querying
5.3.1 Preprocess-once Query-many Paradigm
5.3.1.1 Leveraging the Itemset Lattice
5.3.1.2 Leveraging Data Structures for Querying
5.3.2 Pushing Constraints into Pattern Mining
5.4 Putting Associations to Work: Applications
5.4.1 Relationship to Other Data Mining Problems
5.4.1.1 Application to Classification
5.4.1.2 Application to Clustering
5.4.1.3 Applications to Outlier Detection
5.4.2 Market Basket Analysis
5.4.3 Demographic and Profile Analysis
5.4.4 Recommendations and Collaborative Filtering
5.4.5 Web Log Analysis
5.4.6 Bioinformatics
5.4.7 Other Applications for Complex Data Types
5.5 Summary
5.6 Bibliographic Notes
5.7 Exercises
6 Cluster Analysis
6.1 Introduction
6.2 Feature Selection for Clustering
6.2.1 Filter Models
6.2.1.1 Term Strength
6.2.1.2 Predictive Attribute Dependence
6.2.1.3 Entropy
6.2.1.4 Hopkins Statistic
6.2.2 Wrapper Models
6.3 Representative-Based Algorithms
6.3.1 The k-Means Algorithm
6.3.2 The Kernel k-Means Algorithm
6.3.3 The k-Medians Algorithm
6.3.4 The k-Medoids Algorithm
6.4 Hierarchical Clustering Algorithms
6.4.1 Bottom-Up Agglomerative Methods
6.4.1.1 Group-Based Statistics
6.4.2 Top-Down Divisive Methods
6.4.2.1 Bisecting k-Means
6.5 Probabilistic Model-Based Algorithms
6.5.1 Relationship of EM to k-means and Other Representative Methods
6.6 Grid-Based and Density-Based Algorithms
6.6.1 Grid-Based Methods
6.6.2 DBSCAN
6.6.3 DENCLUE
6.7 Graph-Based Algorithms
6.7.1 Properties of Graph-Based Algorithms
6.8 Non-negative Matrix Factorization
6.8.1 Comparison with Singular Value Decomposition
6.9 Cluster Validation
6.9.1 Internal Validation Criteria
6.9.1.1 Parameter Tuning with Internal Measures
6.9.2 External Validation Criteria
6.9.3 General Comments
6.10 Summary
6.11 Bibliographic Notes
6.12 Exercises
7 Cluster Analysis: Advanced Concepts
7.1 Introduction
7.2 Clustering Categorical Data
7.2.1 Representative-Based Algorithms
7.2.1.1 k-Modes Clustering
7.2.1.2 k-Medoids Clustering
7.2.2 Hierarchical Algorithms
7.2.2.1 ROCK
7.2.3 Probabilistic Algorithms
7.2.4 Graph-Based Algorithms
7.3 Scalable Data Clustering
7.3.1 CLARANS
7.3.2 BIRCH
7.3.3 CURE
7.4 High-Dimensional Clustering
7.4.1 CLIQUE
7.4.2 PROCLUS
7.4.3 ORCLUS
7.5 Semisupervised Clustering
7.5.1 Pointwise Supervision
7.5.2 Pairwise Supervision
7.6 Human and Visually Supervised Clustering
7.6.1 Modifications of Existing Clustering Algorithms
7.6.2 Visual Clustering
7.7 Cluster Ensembles
7.7.1 Selecting Different Ensemble Components
7.7.2 Combining Different Ensemble Components
7.7.2.1 Hypergraph Partitioning Algorithm
7.7.2.2 Meta-clustering Algorithm
7.8 Putting Clustering to Work: Applications
7.8.1 Applications to Other Data Mining Problems
7.8.1.1 Data Summarization
7.8.1.2 Outlier Analysis
7.8.1.3 Classification
7.8.1.4 Dimensionality Reduction
7.8.1.5 Similarity Search and Indexing
7.8.2 Customer Segmentation and Collaborative Filtering
7.8.3 Text Applications
7.8.4 Multimedia Applications
7.8.5 Temporal and Sequence Applications
7.8.6 Social Network Analysis
7.9 Summary
7.10 Bibliographic Notes
7.11 Exercises
8 Outlier Analysis
8.1 Introduction
8.2 Extreme Value Analysis
8.2.1 Univariate Extreme Value Analysis
8.2.2 Multivariate Extreme Values
8.2.3 Depth-Based Methods
8.3 Probabilistic Models
8.4 Clustering for Outlier Detection
8.5 Distance-Based Outlier Detection
8.5.1 Pruning Methods
8.5.1.1 Sampling Methods
8.5.1.2 Early Termination Trick with Nested Loops
8.5.2 Local Distance Correction Methods
8.5.2.1 Local Outlier Factor (LOF)
8.5.2.2 Instance-Specific Mahalanobis Distance
8.6 Density-Based Methods
8.6.1 Histogram- and Grid-Based Techniques
8.6.2 Kernel Density Estimation
8.7 Information-Theoretic Models
8.8 Outlier Validity
8.8.1 Methodological Challenges
8.8.2 Receiver Operating Characteristic
8.8.3 Common Mistakes
8.9 Summary
8.10 Bibliographic Notes
8.11 Exercises
9 Outlier Analysis: Advanced Concepts
9.1 Introduction
9.2 Outlier Detection with Categorical Data
9.2.1 Probabilistic Models
9.2.2 Clustering and Distance-Based Methods
9.2.3 Binary and Set-Valued Data
9.3 High-Dimensional Outlier Detection
9.3.1 Grid-Based Rare Subspace Exploration
9.3.1.1 Modeling Abnormal Lower Dimensional Projections
9.3.1.2 Grid Search for Subspace Outliers
9.3.2 Random Subspace Sampling
9.4 Outlier Ensembles
9.4.1 Categorization by Component Independence
9.4.1.1 Sequential Ensembles
9.4.1.2 Independent Ensembles
9.4.2 Categorization by Constituent Components
9.4.2.1 Model-Centered Ensembles
9.4.2.2 Data-Centered Ensembles
9.4.3 Normalization and Combination
9.5 Putting Outliers to Work: Applications
9.5.1 Quality Control and Fault Detection
9.5.2 Financial Fraud and Anomalous Events
9.5.3 Web Log Analytics
9.5.4 Intrusion Detection Applications
9.5.5 Biological and Medical Applications
9.5.6 Earth Science Applications
9.6 Summary
9.7 Bibliographic Notes
9.8 Exercises
10 Data Classification
10.1 Introduction
10.2 Feature Selection for Classification
10.2.1 Filter Models
10.2.1.1 Gini Index
10.2.1.2 Entropy
10.2.1.3 Fisher Score
10.2.1.4 Fisher's Linear Discriminant
10.2.2 Wrapper Models
10.2.3 Embedded Models
10.3 Decision Trees
10.3.1 Split Criteria
10.3.2 Stopping Criterion and Pruning
10.3.3 Practical Issues
10.4 Rule-Based Classifiers
10.4.1 Rule Generation from Decision Trees
10.4.2 Sequential Covering Algorithms
10.4.2.1 Learn-One-Rule
10.4.3 Rule Pruning
10.4.4 Associative Classifiers
10.5 Probabilistic Classifiers
10.5.1 Naive Bayes Classifier
10.5.1.1 The Ranking Model for Classification
10.5.1.2 Discussion of the Naive Assumption
10.5.2 Logistic Regression
10.5.2.1 Training a Logistic Regression Classifier
10.5.2.2 Relationship with Other Linear Models
10.6 Support Vector Machines
10.6.1 Support Vector Machines for Linearly Separable Data
10.6.1.1 Solving the Lagrangian Dual
10.6.2 Support Vector Machines with Soft Marginfor Nonseparable Data
10.6.2.1 Comparison with Other Linear Models
10.6.3 Nonlinear Support Vector Machines
10.6.4 The Kernel Trick
10.6.4.1 Other Applications of Kernel Methods
10.7 Neural Networks
10.7.1 Single-Layer Neural Network: The Perceptron
10.7.2 Multilayer Neural Networks
10.7.3 Comparing Various Linear Models
10.8 Instance-Based Learning
10.8.1 Design Variations of Nearest Neighbor Classifiers
10.8.1.1 Unsupervised Mahalanobis Metric
10.8.1.2 Nearest Neighbors with Linear Discriminant Analysis
10.9 Classifier Evaluation
10.9.1 Methodological Issues
10.9.1.1 Holdout
10.9.1.2 Cross-Validation
10.9.1.3 Bootstrap
10.9.2 Quantification Issues
10.9.2.1 Output as Class Labels
10.9.2.2 Output as Numerical Score
10.10 Summary
10.11 Bibliographic Notes
10.12 Exercises
11 Data Classification: Advanced Concepts
11.1 Introduction
11.2 Multiclass Learning
11.3 Rare Class Learning
11.3.1 Example Reweighting
11.3.2 Sampling Methods
11.3.2.1 Relationship Between Weighting and Sampling
11.3.2.2 Synthetic Oversampling: SMOTE
11.4 Scalable Classification
11.4.1 Scalable Decision Trees
11.4.1.1 RainForest
11.4.1.2 BOAT
11.4.2 Scalable Support Vector Machines
11.5 Regression Modeling with Numeric Classes
11.5.1 Linear Regression
11.5.1.1 Relationship with Fisher's Linear Discriminant
11.5.2 Principal Component Regression
11.5.3 Generalized Linear Models
11.5.4 Nonlinear and Polynomial Regression
11.5.5 From Decision Trees to Regression Trees
11.5.6 Assessing Model Effectiveness
11.6 Semisupervised Learning
11.6.1 Generic Meta-algorithms
11.6.1.1 Self-Training
11.6.1.2 Co-training
11.6.2 Specific Variations of Classification Algorithms
11.6.2.1 Semisupervised Bayes Classification with EM
11.6.2.2 Transductive Support Vector Machines
11.6.3 Graph-Based Semisupervised Learning
11.6.4 Discussion of Semisupervised Learning
11.7 Active Learning
11.7.1 Heterogeneity-Based Models
11.7.1.1 Uncertainty Sampling
11.7.1.2 Query-by-Committee
11.7.1.3 Expected Model Change
11.7.2 Performance-Based Models
11.7.2.1 Expected Error Reduction
11.7.2.2 Expected Variance Reduction
11.7.3 Representativeness-Based Models
11.8 Ensemble Methods
11.8.1 Why Does Ensemble Analysis Work?
11.8.2 Formal Statement of Bias-Variance Trade-off
11.8.3 Specific Instantiations of Ensemble Learning
11.8.3.1 Bagging
11.8.3.2 Random Forests
11.8.3.3 Boosting
11.8.3.4 Bucket of Models
11.8.3.5 Stacking
11.9 Summary
11.10 Bibliographic Notes
11.11 Exercises
12 Mining Data Streams
12.1 Introduction
12.2 Synopsis Data Structures for Streams
12.2.1 Reservoir Sampling
12.2.1.1 Handling Concept Drift
12.2.1.2 Useful Theoretical Bounds for Sampling
12.2.2 Synopsis Structures for the Massive-Domain Scenario
12.2.2.1 Bloom Filter
12.2.2.2 Count-Min Sketch
12.2.2.3 AMS Sketch
12.2.2.4 Flajolet–Martin Algorithm for Distinct Element Counting
12.3 Frequent Pattern Mining in Data Streams
12.3.1 Leveraging Synopsis Structures
12.3.1.1 Reservoir Sampling
12.3.1.2 Sketches
12.3.2 Lossy Counting Algorithm
12.4 Clustering Data Streams
12.4.1 STREAM Algorithm
12.4.2 CluStream Algorithm
12.4.2.1 Microcluster Definition
12.4.2.2 Microclustering Algorithm
12.4.2.3 Pyramidal Time Frame
12.4.3 Massive-Domain Stream Clustering
12.5 Streaming Outlier Detection
12.5.1 Individual Data Points as Outliers
12.5.2 Aggregate Change Points as Outliers
12.6 Streaming Classification
12.6.1 VFDT Family
12.6.2 Supervised Microcluster Approach
12.6.3 Ensemble Method
12.6.4 Massive-Domain Streaming Classification
12.7 Summary
12.8 Bibliographic Notes
12.9 Exercises
13 Mining Text Data
13.1 Introduction
13.2 Document Preparation and Similarity Computation
13.2.1 Document Normalization and Similarity Computation
13.2.2 Specialized Preprocessing for Web Documents
13.3 Specialized Clustering Methods for Text
13.3.1 Representative-Based Algorithms
13.3.1.1 Scatter/Gather Approach
13.3.2 Probabilistic Algorithms
13.3.3 Simultaneous Document and Word Cluster Discovery
13.3.3.1 Co-clustering
13.4 Topic Modeling
13.4.1 Use in Dimensionality Reduction and Comparison with Latent Semantic Analysis
13.4.2 Use in Clustering and Comparison with Probabilistic Clustering
13.4.3 Limitations of PLSA
13.5 Specialized Classification Methods for Text
13.5.1 Instance-Based Classifiers
13.5.1.1 Leveraging Latent Semantic Analysis
13.5.1.2 Centroid-Based Classification
13.5.1.3 Rocchio Classification
13.5.2 Bayes Classifiers
13.5.2.1 Multinomial Bayes Model
13.5.3 SVM Classifiers for High-Dimensional and Sparse Data
13.6 Novelty and First Story Detection
13.6.1 Micro-clustering Method
13.7 Summary
13.8 Bibliographic Notes
13.9 Exercises
14 Mining Time Series Data
14.1 Introduction
14.2 Time Series Preparation and Similarity
14.2.1 Handling Missing Values
14.2.2 Noise Removal
14.2.3 Normalization
14.2.4 Data Transformation and Reduction
14.2.4.1 Discrete Wavelet Transform
14.2.4.2 Discrete Fourier Transform
14.2.4.3 Symbolic Aggregate Approximation (SAX)
14.2.5 Time Series Similarity Measures
14.3 Time Series Forecasting
14.3.1 Autoregressive Models
14.3.2 Autoregressive Moving Average Models
14.3.3 Multivariate Forecasting with Hidden Variables
14.4 Time Series Motifs
14.4.1 Distance-Based Motifs
14.4.2 Transformation to Sequential Pattern Mining
14.4.3 Periodic Patterns
14.5 Time Series Clustering
14.5.1 Online Clustering of Coevolving Series
14.5.2 Shape-Based Clustering
14.5.2.1 k-Means
14.5.2.2 k-Medoids
14.5.2.3 Hierarchical Methods
14.5.2.4 Graph-Based Methods
14.6 Time Series Outlier Detection
14.6.1 Point Outliers
14.6.2 Shape Outliers
14.7 Time Series Classification
14.7.1 Supervised Event Detection
14.7.2 Whole Series Classification
14.7.2.1 Wavelet-Based Rules
14.7.2.2 Nearest Neighbor Classifier
14.7.2.3 Graph-Based Methods
14.8 Summary
14.9 Bibliographic Notes
14.10 Exercises
15 Mining Discrete Sequences
15.1 Introduction
15.2 Sequential Pattern Mining
15.2.1 Frequent Patterns to Frequent Sequences
15.2.2 Constrained Sequential Pattern Mining
15.3 Sequence Clustering
15.3.1 Distance-Based Methods
15.3.2 Graph-Based Methods
15.3.3 Subsequence-Based Clustering
15.3.4 Probabilistic Clustering
15.3.4.1 Markovian Similarity-Based Algorithm: CLUSEQ
15.3.4.2 Mixture of Hidden Markov Models
15.4 Outlier Detection in Sequences
15.4.1 Position Outliers
15.4.1.1 Efficiency Issues: Probabilistic Suffix Trees
15.4.2 Combination Outliers
15.4.2.1 Distance-Based Models
15.4.2.2 Frequency-Based Models
15.5 Hidden Markov Models
15.5.1 Formal Definition and Techniques for HMMs
15.5.2 Evaluation: Computing the Fit Probability for Observed Sequence
15.5.3 Explanation: Determining the Most Likely State Sequence for Observed Sequence
15.5.4 Training: Baum–Welch Algorithm
15.5.5 Applications
15.6 Sequence Classification
15.6.1 Nearest Neighbor Classifier
15.6.2 Graph-Based Methods
15.6.3 Rule-Based Methods
15.6.4 Kernel Support Vector Machines
15.6.4.1 Bag-of-Words Kernel
15.6.4.2 Spectrum Kernel
15.6.4.3 Weighted Degree Kernel
15.6.5 Probabilistic Methods: Hidden Markov Models
15.7 Summary
15.8 Bibliographic Notes
15.9 Exercises
16 Mining Spatial Data
16.1 Introduction
16.2 Mining with Contextual Spatial Attributes
16.2.1 Shape to Time Series Transformation
16.2.2 Spatial to Multidimensional Transformation with Wavelets
16.2.3 Spatial Colocation Patterns
16.2.4 Clustering Shapes
16.2.5 Outlier Detection
16.2.5.1 Point Outliers
16.2.5.2 Shape Outliers
16.2.6 Classification of Shapes
16.3 Trajectory Mining
16.3.1 Equivalence of Trajectories and Multivariate Time Series
16.3.2 Converting Trajectories to Multidimensional Data
16.3.3 Trajectory Pattern Mining
16.3.3.1 Frequent Trajectory Paths
16.3.3.2 Colocation Patterns
16.3.4 Trajectory Clustering
16.3.4.1 Computing Similarity Between Trajectories
16.3.4.2 Similarity-Based Clustering Methods
16.3.4.3 Trajectory Clustering as a Sequence Clustering Problem
16.3.5 Trajectory Outlier Detection
16.3.5.1 Distance-Based Methods
16.3.5.2 Sequence-Based Methods
16.3.6 Trajectory Classification
16.3.6.1 Distance-Based Methods
16.3.6.2 Sequence-Based Methods
16.4 Summary
16.5 Bibliographic Notes
16.6 Exercises
17 Mining Graph Data
17.1 Introduction
17.2 Matching and Distance Computation in Graphs
17.2.1 Ullman's Algorithm for Subgraph Isomorphism
17.2.1.1 Algorithm Variations and Refinements
17.2.2 Maximum Common Subgraph (MCG) Problem
17.2.3 Graph Matching Methods for Distance Computation
17.2.3.1 MCG-based Distances
17.2.3.2 Graph Edit Distance
17.3 Transformation-Based Distance Computation
17.3.1 Frequent Substructure-Based Transformation and Distance Computation
17.3.2 Topological Descriptors
17.3.3 Kernel-Based Transformations and Computation
17.3.3.1 Random Walk Kernels
17.3.3.2 Shortest-Path Kernels
17.4 Frequent Substructure Mining in Graphs
17.4.1 Node-Based Join Growth
17.4.2 Edge-Based Join Growth
17.4.3 Frequent Pattern Mining to Graph Pattern Mining
17.5 Graph Clustering
17.5.1 Distance-Based Methods
17.5.2 Frequent Substructure-Based Methods
17.5.2.1 Generic Transformational Approach
17.5.2.2 XProj: Direct Clustering with Frequent Subgraph Discovery
17.6 Graph Classification
17.6.1 Distance-Based Methods
17.6.2 Frequent Substructure-Based Methods
17.6.2.1 Generic Transformational Approach
17.6.2.2 XRules: A Rule-Based Approach
17.6.3 Kernel SVMs
17.7 Summary
17.8 Bibliographic Notes
17.9 Exercises
18 Mining Web Data
18.1 Introduction
18.2 Web Crawling and Resource Discovery
18.2.1 A Basic Crawler Algorithm
18.2.2 Preferential Crawlers
18.2.3 Multiple Threads
18.2.4 Combatting Spider Traps
18.2.5 Shingling for Near Duplicate Detection
18.3 Search Engine Indexing and Query Processing
18.4 Ranking Algorithms
18.4.1 PageRank
18.4.1.1 Topic-Sensitive PageRank
18.4.1.2 SimRank
18.4.2 HITS
18.5 Recommender Systems
18.5.1 Content-Based Recommendations
18.5.2 Neighborhood-Based Methods for Collaborative Filtering
18.5.2.1 User-Based Similarity with Ratings
18.5.2.2 Item-Based Similarity with Ratings
18.5.3 Graph-Based Methods
18.5.4 Clustering Methods
18.5.4.1 Adapting k-Means Clustering
18.5.4.2 Adapting Co-Clustering
18.5.5 Latent Factor Models
18.5.5.1 Singular Value Decomposition
18.5.5.2 Matrix Factorization
18.6 Web Usage Mining
18.6.1 Data Preprocessing
18.6.2 Applications
18.7 Summary
18.8 Bibliographic Notes
18.9 Exercises
19 Social Network Analysis
19.1 Introduction
19.2 Social Networks: Preliminaries and Properties
19.2.1 Homophily
19.2.2 Triadic Closure and Clustering Coefficient
19.2.3 Dynamics of Network Formation
19.2.4 Power-Law Degree Distributions
19.2.5 Measures of Centrality and Prestige
19.2.5.1 Degree Centrality and Prestige
19.2.5.2 Closeness Centrality and Proximity Prestige
19.2.5.3 Betweenness Centrality
19.2.5.4 Rank Centrality and Prestige
19.3 Community Detection
19.3.1 Kernighan–Lin Algorithm
19.3.1.1 Speeding Up Kernighan–Lin
19.3.2 Girvan–Newman Algorithm
19.3.3 Multilevel Graph Partitioning: METIS
19.3.4 Spectral Clustering
19.3.4.1 Important Observations and Intuitions
19.4 Collective Classification
19.4.1 Iterative Classification Algorithm
19.4.2 Label Propagation with Random Walks
19.4.2.1 Iterative Label Propagation: The Spectral Interpretation
19.4.3 Supervised Spectral Methods
19.4.3.1 Supervised Feature Generation with Spectral Embedding
19.4.3.2 Graph Regularization Approach
19.4.3.3 Connections with Random Walk Methods
19.5 Link Prediction
19.5.1 Neighborhood-Based Measures
19.5.2 Katz Measure
19.5.3 Random Walk-Based Measures
19.5.4 Link Prediction as a Classification Problem
19.5.5 Link Prediction as a Missing-Value Estimation Problem
19.5.6 Discussion
19.6 Social Influence Analysis
19.6.1 Linear Threshold Model
19.6.2 Independent Cascade Model
19.6.3 Influence Function Evaluation
19.7 Summary
19.8 Bibliographic Notes
19.9 Exercises
20 Privacy-Preserving Data Mining
20.1 Introduction
20.2 Privacy During Data Collection
20.2.1 Reconstructing Aggregate Distributions
20.2.2 Leveraging Aggregate Distributions for Data Mining
20.3 Privacy-Preserving Data Publishing
20.3.1 The k-Anonymity Model
20.3.1.1 Samarati's Algorithm
20.3.1.2 Incognito
20.3.1.3 Mondrian Multidimensional k-Anonymity
20.3.1.4 Synthetic Data Generation: Condensation-Based Approach
20.3.2 The -Diversity Model
20.3.3 The t-closeness Model
20.3.4 The Curse of Dimensionality
20.4 Output Privacy
20.5 Distributed Privacy
20.6 Summary
20.7 Bibliographic Notes
20.8 Exercises
Bibliography
Index