Text Mining: Classification, Clustering, and Applications

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The Definitive Resource on Text Mining Theory and Applications from Foremost Researchers in the Field

Giving a broad perspective of the field from numerous vantage points, Text Mining: Classification, Clustering, and Applications focuses on statistical methods for text mining and analysis. It examines methods to automatically cluster and classify text documents and applies these methods in a variety of areas, including adaptive information filtering, information distillation, and text search.

The book begins with chapters on the classification of documents into predefined categories. It presents state-of-the-art algorithms and their use in practice. The next chapters describe novel methods for clustering documents into groups that are not predefined. These methods seek to automatically determine topical structures that may exist in a document corpus. The book concludes by discussing various text mining applications that have significant implications for future research and industrial use.

There is no doubt that text mining will continue to play a critical role in the development of future information systems and advances in research will be instrumental to their success. This book captures the technical depth and immense practical potential of text mining, guiding readers to a sound appreciation of this burgeoning field.

Author(s): Ashok Srivastava, Mehran Sahami
Series: Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
Edition: 1
Publisher: Chapman & Hall
Year: 2009

Language: English
Pages: 308

Cover Page
......Page 1
Title Page
......Page 2
Text Mining: Classification, Clustering, and Applications......Page 4
Contents......Page 7
List of Figures......Page 13
List of Tables......Page 18
Introduction......Page 20
About the Editors......Page 25
Contributor List......Page 26
1.2 General Overview on Kernel Methods......Page 28
1.2.1 Finding Patterns in Feature Space......Page 32
1.2.2 Formal Properties of Kernel Functions......Page 35
1.2.3 Operations on Kernel Functions......Page 37
1.3.1 Vector Space Model......Page 38
1.3.2 Semantic Kernels......Page 40
1.3.2.1 Designing the Proximity Matrix......Page 42
1.3.3 String Kernels......Page 44
1.4 Example......Page 46
References......Page 49
2.1 Introduction......Page 53
2.2 Overview of the Experiments......Page 55
2.3 Data Collection and Preparation......Page 56
2.3.2 Data Preparation......Page 57
2.3.3 Detection of Matching News Items......Page 58
2.4 News Outlet Identification......Page 61
2.5 Topic-Wise Comparison of Term Bias......Page 64
2.6 News Outlets Map......Page 66
2.6.1 Distance Based on Lexical Choices......Page 68
2.6.2 Distance Based on Choice of Topics......Page 69
2.7 Related Work......Page 70
2.8 Conclusion......Page 71
References......Page 72
Acknowledgments......Page 73
2.10 Appendix B: Bag of Words and Vector Space Models......Page 74
2.11 Appendix C: Kernel Canonical Correlation Analysis......Page 75
2.12 Appendix D: Multidimensional Scaling......Page 76
3.1 Introduction......Page 77
3.3 Approximate Inference Algorithms for Approaches Based on Local Conditional Classifiers......Page 79
3.3.1 Iterative Classification......Page 80
3.3.3 Local Classifiers and Further Optimizations......Page 81
3.4 Approximate Inference Algorithms for Approaches Based on Global Formulations......Page 82
3.4.1 Loopy Belief Propagation......Page 84
3.4.2 Relaxation Labeling via Mean-Field Approach......Page 85
3.6.2 Real-World Datasets......Page 86
3.6.2.1 Results......Page 87
3.6.3 Practical Issues......Page 89
3.7 Related Work......Page 90
References......Page 92
4.1 Introduction......Page 96
4.2 Latent Dirichlet Allocation......Page 97
4.2.1 Statistical Assumptions......Page 98
4.2.2 Exploring a Corpus with the Posterior Distribution......Page 100
4.3 Posterior Inference for LDA......Page 101
4.3.1 Mean Field Variational Inference......Page 103
4.3.2 Practical Considerations......Page 106
4.4.1 The Correlated Topic Model......Page 107
4.4.2 The Dynamic Topic Model......Page 109
4.5 Discussion......Page 114
References......Page 115
5.1 Introduction......Page 119
5.1.2 Related Work......Page 120
5.2 Notation......Page 121
5.3 Tensor Decompositions and Algorithms......Page 122
5.3.2 Nonnegative Tensor Factorization......Page 124
5.4 Enron Subset......Page 126
5.4.1 Term Weighting Techniques......Page 127
5.5.1 Nonnegative Tensor Decomposition......Page 129
5.5.2 Analysis of Three-Way Tensor......Page 130
5.5.3 Analysis of Four-Way Tensor......Page 132
5.6 Visualizing Results of the NMF Clustering......Page 135
5.7 Future Work......Page 140
References......Page 141
6.1 Introduction......Page 145
6.2 Related Work......Page 147
6.3.1 The von Mises-Fisher (vMF) Distribution......Page 148
6.3.2 Maximum Likelihood Estimates......Page 149
6.4 EM on a Mixture of vMFs (moVMF)......Page 150
6.5 Handling High-Dimensional Text Datasets......Page 151
6.5.1 Approximating k......Page 152
6.5.2 Experimental Study of the Approximation......Page 154
6.6 Algorithms......Page 164
6.7 Experimental Results......Page 166
6.7.1 Datasets......Page 167
6.7.3 Simulated Datasets......Page 170
6.7.4 Classic3 Family of Datasets......Page 172
6.7.6 20 Newsgroup Family of Datasets......Page 175
6.7.7 Slashdot Datasets......Page 177
6.8 Discussion......Page 178
6.9 Conclusions and Future Work......Page 180
Acknowledgments......Page 181
References......Page 182
7.1 Introduction......Page 186
7.2.1 Constraint-Based Methods......Page 188
7.2.2 Distance-Based Methods......Page 189
7.3 Text Clustering......Page 190
7.3.1 Pre-Processing......Page 192
7.3.2 Distance Measures......Page 193
7.4.1 COP-KMeans......Page 194
7.4.2 Algorithms with Penalties – PKM, CVQE......Page 195
7.4.2.1 CVQE......Page 196
7.4.4 Probabilistic Penalty – PKM......Page 198
7.5.1 Generalized Mahalanobis Distance Learning......Page 199
7.5.2 Kernel Distance Functions Using AdaBoost......Page 200
7.6.1 Hidden Markov Random Field (HMRF) Model......Page 201
7.6.3 Improvements to HMRF-KMeans......Page 204
7.7.1 Datasets......Page 205
7.7.2 Clustering Evaluation......Page 206
7.7.4 Comparison of Distance Functions......Page 207
7.7.5 Experimental Results......Page 208
References......Page 211
8.1 Introduction......Page 216
8.2 Standard Evaluation Measures......Page 219
8.3.1 Existing Retrieval Models......Page 221
8.3.1.3 Probabilistic models......Page 222
8.3.2.1 Filtering as retrieval + thresholding......Page 223
8.3.2.2 Filtering as text classification......Page 224
8.4 Collaborative Adaptive Filtering......Page 225
8.5 Novelty and Redundancy Detection......Page 227
8.5.2 Geometric Distance......Page 230
8.5.3 Distributional Similarity......Page 231
8.6 Other Adaptive Filtering Topics......Page 232
8.6.2 Using Implicit Feedback......Page 233
8.6.4 Evaluation beyond Topical Relevance......Page 234
References......Page 235
Symbol Description......Page 242
9.1.1 Related Work in Adaptive Filtering (AF)......Page 243
9.1.2 Related Work in Topic Detection and Tracking (TDT)......Page 244
9.1.3 Limitations of Current Solutions......Page 245
9.2 A Sample Task......Page 246
9.3.1 Adaptive Filtering Component......Page 248
9.3.2 Passage Retrieval Component......Page 249
9.3.4 Anti-Redundant Ranking Component......Page 250
9.4.1 Answer Keys......Page 251
9.4.1.2 Nugget-matching rules......Page 252
9.4.2 Evaluating the Utility of a Sequence of Ranked Lists......Page 253
9.4.2.1 Graded passage utility......Page 254
9.5 Data......Page 255
9.6.2 Experimental Setup......Page 256
9.6.3 Results......Page 257
9.8 Acknowledgments......Page 259
References......Page 260
10.1 Entity-Aware Search Architecture......Page 263
10.1.1 Guessing Answer Types......Page 264
10.1.2 Scoring Snippets......Page 265
10.2 Understanding the Question......Page 266
10.2.1 Answer Type Clues in Questions......Page 269
10.2.2 Sequential Labeling of Type Clue Spans......Page 270
10.2.2.1 Parse tree and multiresolution feature table......Page 271
10.2.2.2 Cells and attributes......Page 273
10.2.2.3 Heuristic informer annotation......Page 274
10.2.3 From Type Clue Spans to Answer Types......Page 275
10.2.3.2 Informer hypernym features......Page 276
10.2.4 Experiments......Page 277
10.2.4.1 Informer span tagging accuracy......Page 278
10.2.4.2 Question classification accuracy......Page 280
10.3 Scoring Potential Answer Snippets......Page 281
10.3.1.1 energy and decay......Page 283
10.3.1.2 Aggregating over many selectors......Page 284
10.3.2 Learning the Proximity Scoring Function......Page 285
10.3.3.1 Data collection and preparation......Page 287
10.3.3.3 Fitting the decay profile......Page 288
10.3.3.4 Accuracy using the fitted decay......Page 289
10.4 Indexing and Query Processing......Page 290
10.4.2 Pre-Generalize and Post-Filter......Page 292
10.4.2.1 Forward index......Page 294
10.4.3 Atype Subset Index Space Model......Page 295
10.4.4 Query Time Bloat Model......Page 296
10.4.5 Choosing an Atype Subset......Page 299
10.4.6.2 Observed space-time trade-off......Page 301
10.5.1 Summary......Page 302
10.5.2 Ongoing and Future Work......Page 303
References......Page 305