Data Mining: Concepts and Techniques

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Data Mining: Concepts and Techniques, Fourth Edition introduces concepts, principles, and methods for mining patterns, knowledge, and models from various kinds of data for diverse applications. Specifically, it delves into the processes for uncovering patterns and knowledge from massive collections of data, known as knowledge discovery from data, or KDD. It focuses on the feasibility, usefulness, effectiveness, and scalability of data mining techniques for large data sets.

After an introduction to the concept of data mining, the authors explain the methods for preprocessing, characterizing, and warehousing data. They then partition the data mining methods into several major tasks, introducing concepts and methods for mining frequent patterns, associations, and correlations for large data sets; data classificcation and model construction; cluster analysis; and outlier detection. Concepts and methods for deep learning are systematically introduced as one chapter. Finally, the book covers the trends, applications, and research frontiers in data mining.

Author(s): Jiawei Han, Jian Pei, Hanghang Tong
Series: The Morgan Kaufmann Series in Data Management Systems
Edition: 4
Publisher: Morgan Kaufmann
Year: 2022

Language: English
Commentary: LCC: bullseyes. DDC, UDC, LBC: hits around LCC
Pages: 752
City: Cambridge, MA
Tags: Data Mining; Data Science; Analytical Processing; Pattern Mining; Classification; Cluster Analysis; Deep Learning

Front Cover
Data Mining
Copyright
Contents
Foreword
Foreword to second edition
Preface
Acknowledgments
About the authors
1 Introduction
1.1 What is data mining?
1.2 Data mining: an essential step in knowledge discovery
1.3 Diversity of data types for data mining
1.4 Mining various kinds of knowledge
1.4.1 Multidimensional data summarization
1.4.2 Mining frequent patterns, associations, and correlations
1.4.3 Classification and regression for predictive analysis
1.4.4 Cluster analysis
1.4.5 Deep learning
1.4.6 Outlier analysis
1.4.7 Are all mining results interesting?
1.5 Data mining: confluence of multiple disciplines
1.5.1 Statistics and data mining
1.5.2 Machine learning and data mining
1.5.3 Database technology and data mining
1.5.4 Data mining and data science
1.5.5 Data mining and other disciplines
1.6 Data mining and applications
1.7 Data mining and society
1.8 Summary
1.9 Exercises
1.10 Bibliographic notes
2 Data, measurements, and data preprocessing
2.1 Data types
2.1.1 Nominal attributes
2.1.2 Binary attributes
2.1.3 Ordinal attributes
2.1.4 Numeric attributes
Interval-scaled attributes
Ratio-scaled attributes
2.1.5 Discrete vs. continuous attributes
2.2 Statistics of data
2.2.1 Measuring the central tendency
2.2.2 Measuring the dispersion of data
Range, quartiles, and interquartile range
Five-number summary, boxplots, and outliers
Variance and standard deviation
2.2.3 Covariance and correlation analysis
Covariance of numeric data
Correlation coefficient for numeric data
χ2 correlation test for nominal data
2.2.4 Graphic displays of basic statistics of data
Quantile plot
Quantile-quantile plot
Histograms
Scatter plots and data correlation
2.3 Similarity and distance measures
2.3.1 Data matrix vs. dissimilarity matrix
2.3.2 Proximity measures for nominal attributes
2.3.3 Proximity measures for binary attributes
2.3.4 Dissimilarity of numeric data: Minkowski distance
2.3.5 Proximity measures for ordinal attributes
2.3.6 Dissimilarity for attributes of mixed types
2.3.7 Cosine similarity
2.3.8 Measuring similar distributions: the Kullback-Leibler divergence
2.3.9 Capturing hidden semantics in similarity measures
2.4 Data quality, data cleaning, and data integration
2.4.1 Data quality measures
2.4.2 Data cleaning
Missing values
Noisy data
Data cleaning as a process
2.4.3 Data integration
Entity identification problem
Redundancy and correlation analysis
Tuple duplication
Data value conflict detection and resolution
2.5 Data transformation
2.5.1 Normalization
2.5.2 Discretization
Discretization by binning
Discretization by histogram analysis
2.5.3 Data compression
2.5.4 Sampling
2.6 Dimensionality reduction
2.6.1 Principal components analysis
2.6.2 Attribute subset selection
2.6.3 Nonlinear dimensionality reduction methods
General procedure
Kernel PCA
Stochastic neighbor embedding
2.7 Summary
2.8 Exercises
2.9 Bibliographic notes
3 Data warehousing and online analytical processing
3.1 Data warehouse
3.1.1 Data warehouse: what and why?
3.1.2 Architecture of data warehouses: enterprise data warehouses and data marts
The three-tier architecture
ETL for data warehouses
Enterprise data warehouse and data mart
3.1.3 Data lakes
3.2 Data warehouse modeling: schema and measures
3.2.1 Data cube: a multidimensional data model
3.2.2 Schemas for multidimensional data models: stars, snowflakes, and fact constellations
3.2.3 Concept hierarchies
3.2.4 Measures: categorization and computation
3.3 OLAP operations
3.3.1 Typical OLAP operations
3.3.2 Indexing OLAP data: bitmap index and join index
Bitmap indexing
Join indexing
3.3.3 Storage implementation: column-based databases
3.4 Data cube computation
3.4.1 Terminology of data cube computation
3.4.2 Data cube materialization: ideas
3.4.3 OLAP server architectures: ROLAP vs. MOLAP vs. HOLAP
3.4.4 General strategies for data cube computation
3.5 Data cube computation methods
3.5.1 Multiway array aggregation for full cube computation
3.5.2 BUC: computing iceberg cubes from the apex cuboid downward
3.5.3 Precomputing shell fragments for fast high-dimensional OLAP
3.5.4 Efficient processing of OLAP queries using cuboids
3.6 Summary
3.7 Exercises
3.8 Bibliographic notes
4 Pattern mining: basic concepts and methods
4.1 Basic concepts
4.1.1 Market basket analysis: a motivating example
4.1.2 Frequent itemsets, closed itemsets, and association rules
4.2 Frequent itemset mining methods
4.2.1 Apriori algorithm: finding frequent itemsets by confined candidate generation
4.2.2 Generating association rules from frequent itemsets
4.2.3 Improving the efficiency of Apriori
4.2.4 A pattern-growth approach for mining frequent itemsets
4.2.5 Mining frequent itemsets using the vertical data format
4.2.6 Mining closed and max patterns
4.3 Which patterns are interesting?—Pattern evaluation methods
4.3.1 Strong rules are not necessarily interesting
4.3.2 From association analysis to correlation analysis
4.3.3 A comparison of pattern evaluation measures
4.4 Summary
4.5 Exercises
4.6 Bibliographic notes
5 Pattern mining: advanced methods
5.1 Mining various kinds of patterns
5.1.1 Mining multilevel associations
5.1.2 Mining multidimensional associations
5.1.3 Mining quantitative association rules
Data cube–based mining of quantitative associations
Mining clustering-based quantitative associations
Using statistical theory to disclose exceptional behavior
5.1.4 Mining high-dimensional data
5.1.5 Mining rare patterns and negative patterns
5.2 Mining compressed or approximate patterns
5.2.1 Mining compressed patterns by pattern clustering
5.2.2 Extracting redundancy-aware top-k patterns
5.3 Constraint-based pattern mining
5.3.1 Pruning pattern space with pattern pruning constraints
Pattern antimonotonicity
Pattern monotonicity
Convertible constraints: ordering data in transactions
5.3.2 Pruning data space with data pruning constraints
5.3.3 Mining space pruning with succinctness constraints
5.4 Mining sequential patterns
5.4.1 Sequential pattern mining: concepts and primitives
5.4.2 Scalable methods for mining sequential patterns
GSP: a sequential pattern mining algorithm based on candidate generate-and-test
SPADE: an Apriori-based vertical data format sequential pattern mining algorithm
PrefixSpan: prefix-projected sequential pattern growth
Mining closed sequential patterns
Mining multidimensional, multilevel sequential patterns
5.4.3 Constraint-based mining of sequential patterns
5.5 Mining subgraph patterns
5.5.1 Methods for mining frequent subgraphs
Apriori-based approach
Pattern-growth approach
5.5.2 Mining variant and constrained substructure patterns
Mining closed frequent substructures
Extension of pattern-growth approach: mining alternative substructure patterns
Mining substructure patterns with user-specified constraints
Mining approximate frequent substructures
Mining coherent substructures
5.6 Pattern mining: application examples
5.6.1 Phrase mining in massive text data
How to judge the quality of a phrase?
Phrasal segmentation and computing phrase quality
Phrase mining methods
5.6.2 Mining copy and paste bugs in software programs
5.7 Summary
5.8 Exercises
5.9 Bibliographic notes
6 Classification: basic concepts and methods
6.1 Basic concepts
6.1.1 What is classification?
6.1.2 General approach to classification
6.2 Decision tree induction
6.2.1 Decision tree induction
6.2.2 Attribute selection measures
Information gain
Gain ratio
Gini impurity
Other attribute selection measures
6.2.3 Tree pruning
6.3 Bayes classification methods
6.3.1 Bayes' theorem
6.3.2 Naïve Bayesian classification
6.4 Lazy learners (or learning from your neighbors)
6.4.1 k-nearest-neighbor classifiers
6.4.2 Case-based reasoning
6.5 Linear classifiers
6.5.1 Linear regression
6.5.2 Perceptron: turning linear regression to classification
6.5.3 Logistic regression
6.6 Model evaluation and selection
6.6.1 Metrics for evaluating classifier performance
6.6.2 Holdout method and random subsampling
6.6.3 Cross-validation
6.6.4 Bootstrap
6.6.5 Model selection using statistical tests of significance
6.6.6 Comparing classifiers based on cost–benefit and ROC curves
6.7 Techniques to improve classification accuracy
6.7.1 Introducing ensemble methods
6.7.2 Bagging
6.7.3 Boosting
6.7.4 Random forests
6.7.5 Improving classification accuracy of class-imbalanced data
6.8 Summary
6.9 Exercises
6.10 Bibliographic notes
7 Classification: advanced methods
7.1 Feature selection and engineering
7.1.1 Filter methods
7.1.2 Wrapper methods
7.1.3 Embedded methods
7.2 Bayesian belief networks
7.2.1 Concepts and mechanisms
7.2.2 Training Bayesian belief networks
7.3 Support vector machines
7.3.1 Linear support vector machines
7.3.2 Nonlinear support vector machines
7.4 Rule-based and pattern-based classification
7.4.1 Using IF-THEN rules for classification
7.4.2 Rule extraction from a decision tree
7.4.3 Rule induction using a sequential covering algorithm
Rule quality measures
Rule pruning
7.4.4 Associative classification
7.4.5 Discriminative frequent pattern–based classification
7.5 Classification with weak supervision
7.5.1 Semisupervised classification
7.5.2 Active learning
7.5.3 Transfer learning
7.5.4 Distant supervision
7.5.5 Zero-shot learning
7.6 Classification with rich data type
7.6.1 Stream data classification
7.6.2 Sequence classification
7.6.3 Graph data classification
7.7 Potpourri: other related techniques
7.7.1 Multiclass classification
7.7.2 Distance metric learning
7.7.3 Interpretability of classification
7.7.4 Genetic algorithms
7.7.5 Reinforcement learning
7.8 Summary
7.9 Exercises
7.10 Bibliographic notes
8 Cluster analysis: basic concepts and methods
8.1 Cluster analysis
8.1.1 What is cluster analysis?
8.1.2 Requirements for cluster analysis
8.1.3 Overview of basic clustering methods
8.2 Partitioning methods
8.2.1 k-Means: a centroid-based technique
8.2.2 Variations of k-means
k-Medoids: a representative object-based technique
k-Modes: clustering nominal data
Initialization in partitioning methods
Estimating the number of clusters
Applying feature transformation
8.3 Hierarchical methods
8.3.1 Basic concepts of hierarchical clustering
8.3.2 Agglomerative hierarchical clustering
Similarity measures in hierarchical clustering
Connecting agglomerative hierarchical clustering and partitioning methods
The Lance-Williams algorithm
8.3.3 Divisive hierarchical clustering
The minimum spanning tree–based approach
Dendrogram
8.3.4 BIRCH: scalable hierarchical clustering using clustering feature trees
8.3.5 Probabilistic hierarchical clustering
8.4 Density-based and grid-based methods
8.4.1 DBSCAN: density-based clustering based on connected regions with high density
8.4.2 DENCLUE: clustering based on density distribution functions
8.4.3 Grid-based methods
8.5 Evaluation of clustering
8.5.1 Assessing clustering tendency
8.5.2 Determining the number of clusters
8.5.3 Measuring clustering quality: extrinsic methods
Extrinsic vs. intrinsic methods
Desiderata of extrinsic methods
Categories of extrinsic methods
Matching-based methods
Information theory–based methods
Pairwise comparison–based methods
8.5.4 Intrinsic methods
8.6 Summary
8.7 Exercises
8.8 Bibliographic notes
9 Cluster analysis: advanced methods
9.1 Probabilistic model-based clustering
9.1.1 Fuzzy clusters
9.1.2 Probabilistic model-based clusters
9.1.3 Expectation-maximization algorithm
9.2 Clustering high-dimensional data
9.2.1 Why is clustering high-dimensional data challenging?
Motivations of clustering analysis on high-dimensional data
High-dimensional clustering models
Categorization of high-dimensional clustering methods
9.2.2 Axis-parallel subspace approaches
CLIQUE: a subspace clustering method
PROCLUS: a projected clustering method
Soft projected clustering methods
9.2.3 Arbitrarily oriented subspace approaches
9.3 Biclustering
9.3.1 Why and where is biclustering useful?
9.3.2 Types of biclusters
9.3.3 Biclustering methods
Optimization using the δ-cluster algorithm
9.3.4 Enumerating all biclusters using MaPle
9.4 Dimensionality reduction for clustering
9.4.1 Linear dimensionality reduction methods for clustering
9.4.2 Nonnegative matrix factorization (NMF)
9.4.3 Spectral clustering
Similarity graph
Finding a new space
Extracting clusters
9.5 Clustering graph and network data
9.5.1 Applications and challenges
9.5.2 Similarity measures
Geodesic distance
SimRank: similarity based on random walk and structural context
Personalized PageRank and topical PageRank
9.5.3 Graph clustering methods
Generic high-dimensional clustering methods on graphs
Specific clustering methods by searching graph structures
Probabilistic graphical model-based methods
9.6 Semisupervised clustering
9.6.1 Semisupervised clustering on partially labeled data
9.6.2 Semisupervised clustering on pairwise constraints
9.6.3 Other types of background knowledge for semisupervised clustering
Semisupervised hierarchical clustering
Clusters associated with outcome variables
Active and interactive learning for semisupervised clustering
9.7 Summary
9.8 Exercises
9.9 Bibliographic notes
10 Deep learning
10.1 Basic concepts
10.1.1 What is deep learning?
10.1.2 Backpropagation algorithm
10.1.3 Key challenges for training deep learning models
10.1.4 Overview of deep learning architecture
10.2 Improve training of deep learning models
10.2.1 Responsive activation functions
10.2.2 Adaptive learning rate
10.2.3 Dropout
10.2.4 Pretraining
10.2.5 Cross-entropy
10.2.6 Autoencoder: unsupervised deep learning
10.2.7 Other techniques
10.3 Convolutional neural networks
10.3.1 Introducing convolution operation
10.3.2 Multidimensional convolution
10.3.3 Convolutional layer
10.4 Recurrent neural networks
10.4.1 Basic RNN models and applications
10.4.2 Gated RNNs
10.4.3 Other techniques for addressing long-term dependence
10.5 Graph neural networks
10.5.1 Basic concepts
10.5.2 Graph convolutional networks
10.5.3 Other types of GNNs
10.6 Summary
10.7 Exercises
10.8 Bibliographic notes
11 Outlier detection
11.1 Basic concepts
11.1.1 What are outliers?
11.1.2 Types of outliers
Global outliers
Contextual outliers
Collective outliers
11.1.3 Challenges of outlier detection
11.1.4 An overview of outlier detection methods
Supervised, semisupervised, and unsupervised methods
Statistical methods, proximity-based methods, and reconstruction-based methods
11.2 Statistical approaches
11.2.1 Parametric methods
Detection of univariate outliers based on normal distribution
Detection of multivariate outliers
Using a mixture of parametric distributions
11.2.2 Nonparametric methods
11.3 Proximity-based approaches
11.3.1 Distance-based outlier detection
11.3.2 Density-based outlier detection
11.4 Reconstruction-based approaches
11.4.1 Matrix factorization–based methods for numerical data
11.4.2 Pattern-based compression methods for categorical data
11.5 Clustering- vs. classification-based approaches
11.5.1 Clustering-based approaches
11.5.2 Classification-based approaches
11.6 Mining contextual and collective outliers
11.6.1 Transforming contextual outlier detection to conventional outlier detection
11.6.2 Modeling normal behavior with respect to contexts
11.6.3 Mining collective outliers
11.7 Outlier detection in high-dimensional data
11.7.1 Extending conventional outlier detection
11.7.2 Finding outliers in subspaces
11.7.3 Outlier detection ensemble
11.7.4 Taming high dimensionality by deep learning
11.7.5 Modeling high-dimensional outliers
11.8 Summary
11.9 Exercises
11.10 Bibliographic notes
12 Data mining trends and research frontiers
12.1 Mining rich data types
12.1.1 Mining text data
Bibliographic notes
12.1.2 Spatial-temporal data
Auto-correlation and heterogeneity in spatial and temporal data
Spatial and temporal data types
Spatial and temporal data models
Bibliographic notes
12.1.3 Graph and networks
Bibliographic notes
12.2 Data mining applications
12.2.1 Data mining for sentiment and opinion
What are sentiments and opinions?
Sentiment analysis and opinion mining techniques
Sentiment analysis and opinion mining applications
Bibliographic notes
12.2.2 Truth discovery and misinformation identification
Truth discovery
Identification of misinformation
Bibliographic notes
12.2.3 Information and disease propagation
Bibliographic notes
12.2.4 Productivity and team science
Bibliographic notes
12.3 Data mining methodologies and systems
12.3.1 Structuring unstructured data for knowledge mining: a data-driven approach
Bibliographic notes
12.3.2 Data augmentation
Bibliographic notes
12.3.3 From correlation to causality
Bibliographic notes
12.3.4 Network as a context
Bibliographic notes
12.3.5 Auto-ML: methods and systems
Bibliographic notes
12.4 Data mining, people, and society
12.4.1 Privacy-preserving data mining
12.4.2 Human-algorithm interaction
Bibliographic notes
12.4.3 Mining beyond maximizing accuracy: fairness, interpretability, and robustness
Bibliographic notes
12.4.4 Data mining for social good
Bibliographic notes
A Mathematical background
A.1 Probability and statistics
A.1.1 PDF of typical distributions
A.1.2 MLE and MAP
A.1.3 Significance test
A.1.4 Density estimation
A.1.5 Bias-variance tradeoff
A.1.6 Cross-validation and Jackknife
Cross-validation
Jackknife
A.2 Numerical optimization
A.2.1 Gradient descent
A.2.2 Variants of gradient descent
A.2.3 Newton's method
A.2.4 Coordinate descent
A.2.5 Quadratic programming
A.3 Matrix and linear algebra
A.3.1 Linear system Ax=b
Standard square system
Overdetermined system
Underdetermined system
A.3.2 Norms of vectors and matrices
Norms of vectors
Norms of matrices
A.3.3 Matrix decompositions
Eigenvalues and eigendecomposition
Singular value decomposition (SVD)
A.3.4 Subspace
A.3.5 Orthogonality
A.4 Concepts and tools from signal processing
A.4.1 Entropy
A.4.2 Kullback-Leibler divergence (KL-divergence)
A.4.3 Mutual information
A.4.4 Discrete Fourier transform (DFT) and fast Fourier transform (FFT)
A.5 Bibliographic notes
Bibliography
Index
Back Cover