Text Mining: Applications and Theory

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Text Mining: Applications and Theory presents the state-of-the-art algorithms for text mining from both the academic and industrial perspectives.  The contributors span several countries and scientific domains: universities, industrial corporations, and government laboratories, and demonstrate the use of techniques from machine learning, knowledge discovery, natural language processing and information retrieval to design computational models for automated text analysis and mining.

This volume demonstrates how advancements in the fields of applied mathematics, computer science, machine learning, and natural language processing can collectively capture, classify, and interpret words and their contexts.  As suggested in the preface, text mining is needed when “words are not enough.”

This book:

  • Provides state-of-the-art algorithms and techniques for critical tasks in text mining applications, such as clustering, classification, anomaly and trend detection, and stream analysis.
  • Presents a survey of text visualization techniques and looks at the multilingual text classification problem.
  • Discusses the issue of cybercrime associated with chatrooms.
  • Features advances in visual analytics and machine learning along with illustrative examples.
  • Is accompanied by a supporting website featuring datasets.

Applied mathematicians, statisticians, practitioners and students in computer science, bioinformatics and engineering will find this book extremely useful.

Author(s): Michael W. Berry, Jacob Kogan
Publisher: Wiley
Year: 2010

Language: English
Pages: 223
Tags: Информатика и вычислительная техника;Искусственный интеллект;Компьютерная лингвистика;

Text Mining......Page 5
Contents......Page 7
List of Contributors......Page 13
Preface......Page 15
PART I TEXT EXTRACTION, CLASSIFICATION, AND CLUSTERING......Page 17
1.1 Introduction......Page 19
1.1.1 Keyword extraction methods......Page 20
1.2 Rapid automatic keyword extraction......Page 21
1.2.1 Candidate keywords......Page 22
1.2.2 Keyword scores......Page 23
1.2.4 Extracted keywords......Page 24
1.3.1 Evaluating precision and recall......Page 25
1.3.2 Evaluating efficiency......Page 26
1.4 Stoplist generation......Page 27
1.5.2 Extracting keywords from news articles......Page 31
1.6 Summary......Page 34
References......Page 35
2.1 Introduction......Page 37
2.2 Background......Page 38
2.3 Experimental setup......Page 39
2.4 Multilingual LSA......Page 41
2.5 Tucker1 method......Page 43
2.6 PARAFAC2 method......Page 44
2.7 LSA with term alignments......Page 45
2.8 Latent morpho-semantic analysis (LMSA)......Page 48
2.10 Discussion of results and techniques......Page 49
References......Page 51
3.1 Introduction......Page 53
3.2.1 Naive Bayes......Page 55
3.2.2 LogitBoost......Page 56
3.2.3 Support vector machines......Page 57
3.2.4 Augmented latent semantic indexing spaces......Page 59
3.2.5 Radial basis function networks......Page 60
3.3.1 Feature selection......Page 61
3.3.2 Message representation......Page 63
3.4 Evaluation of email classification......Page 64
3.5.1 Experiments with PU1......Page 65
3.5.2 Experiments with ZH1......Page 67
3.6 Characteristics of classifiers......Page 69
3.7 Concluding remarks......Page 70
References......Page 71
4.1 Introduction......Page 73
4.1.1 Related work......Page 75
4.2.1 Nonnegative matrix factorization......Page 76
4.2.2 Algorithms for computing NMF......Page 77
4.2.3 Datasets......Page 79
4.2.4 Interpretation......Page 80
4.3 NMF initialization based on feature ranking......Page 81
4.3.2 FS initialization......Page 82
4.4.1 Classification using basis features......Page 86
4.4.2 Generalizing LSI based on NMF......Page 88
4.5 Conclusions......Page 94
References......Page 95
5.1 Introduction......Page 97
5.2 Notations and classical k-means......Page 98
5.3.1 Quadratic k-means with cannot-link constraints......Page 100
5.3.2 Elimination of must-link constraints......Page 103
5.3.3 Clustering with Bregman divergences......Page 105
5.4 Constrained smoka type clustering......Page 108
5.5 Constrained spherical k-means......Page 111
5.5.1 Spherical k-means with cannot-link constraints only......Page 112
5.5.2 Spherical k-means with cannot-link and must-link constraints......Page 114
5.6 Numerical experiments......Page 115
5.6.2 Spherical k-means......Page 116
5.7 Conclusion......Page 117
References......Page 118
PART II ANOMALY AND TREND DETECTION......Page 121
6.1 Visualization in text analysis......Page 123
6.2 Tag clouds......Page 124
6.3 Authorship and change tracking......Page 126
6.5 Sentiment tracking......Page 127
6.6 Visual analytics and FutureLens......Page 129
6.7 Scenario discovery......Page 130
6.7.2 Evaluating solutions......Page 131
6.8 Earlier prototype......Page 132
6.9 Features of FutureLens......Page 133
6.10 Scenario discovery example: bioterrorism......Page 135
6.11 Scenario discovery example: drug trafficking......Page 137
6.12 Future work......Page 139
References......Page 142
7.1 Introduction......Page 145
7.2.1 Background......Page 147
7.2.3 Gaussian-based adaptive threshold setting......Page 148
7.2.4 Implementation issues......Page 153
7.3.1 Datasets......Page 154
7.3.2 Working example......Page 155
7.3.3 Experiments and results......Page 158
7.4 Conclusion......Page 162
References......Page 163
8.1 Introduction......Page 165
8.2.1 Capturing IM and IRC chat......Page 167
8.2.2 Current collections for use in analysis......Page 168
8.2.4 Internet predation detection......Page 169
8.2.5 Cyberbullying detection......Page 174
8.3 Commercial software for monitoring chat......Page 175
8.4 Conclusions and future directions......Page 177
References......Page 178
PART III TEXT STREAMS......Page 181
9.1 Introduction......Page 183
9.2 Text streams......Page 185
9.3 Feature extraction and data reduction......Page 186
9.4 Event detection......Page 187
9.5 Trend detection......Page 190
9.6 Event and trend descriptions......Page 192
9.7 Discussion......Page 196
References......Page 197
10.1 Introduction......Page 199
10.2.1 Vector space modeling......Page 200
10.2.3 Probabilistic latent semantic analysis......Page 201
10.3 Latent Dirichlet allocation......Page 202
10.3.2 Posterior inference......Page 203
10.3.3 Online latent Dirichlet allocation (OLDA)......Page 205
10.3.4 Illustrative example......Page 207
10.4 Embedding external semantics from Wikipedia......Page 209
10.5 Data-driven semantic embedding......Page 210
10.5.1 Generative process with data-driven semantic embedding......Page 211
10.5.2 OLDA algorithm with data-driven semantic embedding......Page 212
10.5.3 Experimental design......Page 213
10.5.4 Experimental results......Page 215
10.7 Conclusion and future work......Page 218
References......Page 219
Index......Page 221