Text mining tries to solve the crisis of information overload by combining techniques from data mining, machine learning, natural language processing, information retrieval, and knowledge management. In addition to providing an in-depth examination of core text mining and link detection algorithms and operations, this book examines advanced pre-processing techniques, knowledge representation considerations, and visualization approaches. Finally, it explores current real-world, mission-critical applications of text mining and link detection in such varied fields as M&A business intelligence, genomics research and counter-terrorism activities.
Author(s): Ronen Feldman, James Sanger
Year: 2006
Language: English
Pages: 422
Half-title......Page 3
Title......Page 5
Copyright......Page 6
Dedication......Page 7
Contents......Page 9
Preface......Page 12
ACKNOWLEDGMENTS......Page 13
I.1 DEFINING TEXT MINING......Page 15
I.1.1 The Document Collection and the Document......Page 16
“Weakly Structured” and “Semistructured” Documents......Page 17
I.1.2 Document Features......Page 18
Commonly Used Document Features: Characters, Words, Terms, and Concepts......Page 19
I.1.3 The Search for Patterns and Trends......Page 22
I.1.4 The Importance of the Presentation Layer......Page 24
Sections I.1–I.1.1......Page 25
Section I.1.3......Page 26
I.2.1 Functional Architecture......Page 27
Section I.2......Page 31
Section I.2.1......Page 32
II.1.1 Distributions......Page 33
Frequent Concept Sets......Page 37
Discovering Frequent Concept Sets......Page 38
II.1.3 Associations......Page 39
Discovering Association Rules......Page 40
Maximal Association Rules: Defining M-Support and M-Confidence......Page 41
M-Factor......Page 42
Interestingness with Respect to Distributions and Proportions......Page 43
Trend Analysis......Page 44
Ephemeral Associations......Page 45
From Context Relationships to Trend Graphs......Page 46
The Context Graph......Page 47
The Trend Graph......Page 49
The Borders Incremental Text Mining Algorithm......Page 50
Section II.1.2......Page 53
Section II.1.5......Page 54
II.2.1 Domains and Background Knowledge......Page 55
II.2.2 Domain Ontologies......Page 56
II.2.3 Domain Lexicons......Page 57
II.2.4 Introducing Background Knowledge into Text Mining Systems......Page 58
System Architecture......Page 60
Implementation......Page 61
Experimental Performance Results......Page 64
II.3 TEXT MINING QUERY LANGUAGES......Page 65
II.3.2 KDTL Query Examples......Page 66
Sections II.3–II.3.2......Page 69
III Text Mining Preprocessing Techniques......Page 71
III.1 TASK-ORIENTED APPROACHES......Page 72
III.1.1 General Purpose NLP Tasks......Page 73
Syntactical Parsing......Page 74
III.1.2 Problem-Dependent Tasks: Text Categorization and Information Extraction......Page 75
Constituency Grammars......Page 76
General Information Extraction......Page 77
IV Categorization......Page 78
IV.1.2 Document Sorting and Text Filtering......Page 79
IV.2 DEFINITION OF THE PROBLEM......Page 80
IV.2.3 Hard versus Soft Categorization......Page 81
IV.3.1 Feature Selection......Page 82
IV.3.2 Dimensionality Reduction by Feature Extraction......Page 83
IV.5 MACHINE LEARNING APPROACH TO TC......Page 84
IV.5.2 Bayesian Logistic Regression......Page 85
IV.5.3 Decision Tree Classifiers......Page 86
IV.5.4 Decision Rule Classifiers......Page 87
IV.5.6 The Rocchio Methods......Page 88
IV.5.8 Example-Based Classifiers......Page 89
IV.5.9 Support Vector Machines......Page 90
IV.5.10 Classifier Committees: Bagging and Boosting......Page 91
IV.6 USING UNLABELED DATA TO IMPROVE CLASSIFICATION......Page 92
IV.7.2 Benchmark Collections......Page 93
Section IV.3......Page 94
Section IV.7......Page 95
V.1.1 Improving Search Recall......Page 96
V.1.4 Query-Specific Clustering......Page 97
V.2.1 Problem Representation......Page 98
V.3 CLUSTERING ALGORITHMS......Page 99
V.3.1 K-Means Algorithm......Page 100
V.3.3 Hierarchical Agglomerative Clustering (HAC)......Page 101
V.4 CLUSTERING OF TEXTUAL DATA......Page 102
V.4.3 Singular Value Decomposition......Page 103
Using Naïve Bayes Mixture Models with the EM Clustering Algorithm......Page 104
V.4.5 Evaluation of Text Clustering......Page 105
Section V.4......Page 106
VI.1 INTRODUCTION TO INFORMATION EXTRACTION......Page 108
VI.2.1 Named Entity Recognition......Page 110
VI.2.2 Template Element Task......Page 112
VI.2.5 Coreference Task (CO)......Page 113
VI.2.6 Some Notes about IE Evaluation......Page 114
VI.3.2 Case 2: Natural Disasters Domain......Page 115
VI.3.4 Technology-Related Article, TIPSTER-Style Tagging......Page 116
VI.4 ARCHITECTURE OF IE SYSTEMS......Page 118
VI.4.1 Information Flow in an IE System......Page 119
Proper Name Identification......Page 120
Shallow Parsing......Page 121
Inferencing......Page 122
VI.5 ANAPHORA RESOLUTION......Page 123
VI.5.4 Predicate Nominative......Page 124
VI.5.8 One-Anaphora......Page 125
VI.5.10.1 Hobbs Algorithm......Page 126
VI.5.11.1 Kennedy and Boguraev......Page 127
VI.5.11.2 Mitkov......Page 128
VI.5.11.3 Evaluation of Knowledge-Poor Approaches......Page 130
VI.5.11.4 Machine Learning Approaches......Page 131
VI.6.2 BWI......Page 133
VI.6.3 The (LP)2 Algorithm......Page 134
VI.6.4 Experimental Evaluation......Page 135
VI.7.2 Overall Problem Definition......Page 136
VI.7.4 Problem Formulation for the Perceptual Grouping Subtask......Page 137
VI.7.5 Algorithm for Constructing a Document O-Tree......Page 138
VI.7.6.1 Basic Algorithm......Page 139
VI.7.7 Templates......Page 141
VI.7.8 Experimental Results......Page 142
Section VI.4......Page 143
Section VI.6......Page 144
VII.1 HIDDEN MARKOV MODELS......Page 145
VII.1.2 The Forward–Backward Procedure......Page 146
VII.1.3 The Viterbi Algorithm......Page 147
VII.1.4 The Training of the HMM......Page 149
VII.1.5 Dealing with Training Data Sparseness......Page 150
VII.2.1 Using SCFGs......Page 151
VII.3 MAXIMAL ENTROPY MODELING......Page 152
VII.4 MAXIMAL ENTROPY MARKOV MODELS......Page 154
VII.4.1 Training the MEMM......Page 155
VII.5 CONDITIONAL RANDOM FIELDS......Page 156
VII.5.2 Computing the Conditional Probability......Page 157
VII.5.4 Training the CRF......Page 158
Section VII.5......Page 159
VIII.1.1 Using HMM to Extract Fields from Whole Documents......Page 160
VIII.1.2 Learning HMM Structure from Data......Page 162
VIII.1.3 Nymble: An HMM with Context-Dependent Probabilities......Page 163
VIII.2 USING MEMM FOR INFORMATION EXTRACTION......Page 166
VIII.3.1 POS-Tagging with Conditional Random Fields......Page 167
VIII.3.2 Shallow Parsing with Conditional Random Fields......Page 168
VIII.4.1 Introduction to a Hybrid System......Page 169
VIII.4.3 Syntax of a TEG Rulebook......Page 170
VIII.4.4 TEG Training......Page 172
VIII.4.5 Additional features......Page 175
VIII.4.6 Example of Real Rules......Page 176
ACE-2 Evaluation: Extracting Relationships......Page 178
VIII.5.1 Introduction to Bootstrapping: The AutoSlog-TS Approach......Page 180
VIII.5.2 Mutual Bootstrapping......Page 182
VIII.5.3 Metabootstrapping......Page 183
Evaluation of the Metabootstrapping Algorithm......Page 184
VIII.5.4 Using Strong Syntactic Heuristics......Page 185
VIII.5.4.2 Using Cotraining......Page 186
VIII.5.5 The Basilisk Algorithm......Page 187
VIII.5.6 Bootstrapping by Using Term Categorization......Page 188
Section VIII.3......Page 189
Section VIII.5......Page 190
IX.1 BROWSING......Page 191
IX.1.1 Displaying and Browsing Distributions......Page 193
IX.1.2 Displaying and Exploring Associations......Page 194
IX.1.3 Navigation and Exploration by Means of Concept Hierarchies......Page 196
IX.1.4 Concept Hierarchy and Taxonomy Editors......Page 197
IX.1.5 Clustering Tools to Aid Data Exploration......Page 198
IX.2 ACCESSING CONSTRAINTS AND SIMPLE SPECIFICATION FILTERS AT THE PRESENTATION LAYER......Page 199
IX.3 ACCESSING THE UNDERLYING QUERY LANGUAGE......Page 200
Section IX.1......Page 201
Section IX.3......Page 202
X.1 INTRODUCTION......Page 203
X.1.1 Citations and Notes......Page 205
X.2 ARCHITECTURAL CONSIDERATIONS......Page 206
X.3.1 Overview......Page 208
Simple Concept Set Graphs......Page 209
Simple Concept Association Graphs......Page 212
Similarity Functions for Simple Concept Association Graphs......Page 214
Equivalence Classes, Partial Orderings, Redundancy Filters......Page 215
Typical Interactive Operations Using Simple Concept Graphs......Page 216
Drawbacks of Simple Concept Graphs......Page 218
X.3.3 Histograms......Page 219
X.3.4 Line Graphs......Page 221
X.3.5 Circle Graphs......Page 222
Category-Connecting Maps......Page 225
Multiple Circle Graph and Combination Graph Approaches......Page 226
WEBSOM......Page 227
SOM Algorithm......Page 230
X.3.7 Hyperbolic Trees......Page 231
X.3.8 Three-Dimensional (3-D) Effects......Page 233
X.3.9 Hybrid Tools......Page 235
Sections X.3.4–X.3.7......Page 238
X.4 VISUALIZATION TECHNIQUES IN LINK ANALYSIS......Page 239
X.4.1 Practical Approaches Using Generic Visualization Tools......Page 240
X.4.2 “Fisheye” Diagrams......Page 241
Distorting Fisheye Views......Page 242
Filtering Fisheye Views......Page 243
Applications to Link Detection and General Effectiveness of Fisheye Approaches......Page 244
X.4.3 Spring-Embedded Network Graphs......Page 245
X.4.4 Critical Path and Pathway Analysis Graphs......Page 248
X.5 REAL-WORLD EXAMPLE: THE DOCUMENT EXPLORER SYSTEM......Page 249
Visual Administrative Tools: Term Hierarchy Editor......Page 251
Visualization Tools......Page 252
X.5.2 Citations and Notes......Page 254
XI.1 PRELIMINARIES......Page 256
XI.2 AUTOMATIC LAYOUT OF NETWORKS......Page 258
XI.2.1 Force-Directed Graph Layout Algorithms......Page 259
Fruchterman–Reingold (FR) Method......Page 260
XI.3 PATHS AND CYCLES IN GRAPHS......Page 262
XI.4.1 Degree Centrality......Page 263
XI.4.2 Closeness Centrality......Page 265
XI.4.3 Betweeness Centrality......Page 266
XI.4.4 Eigenvector Centrality......Page 267
XI.4.5 Power Centrality......Page 268
XI.4.6 Network Centralization......Page 269
XI.4.7 Summary Diagram......Page 270
XI.5 PARTITIONING OF NETWORKS......Page 271
Algorithm for finding the main core......Page 272
XI.5.3 Equivalence between Entities......Page 274
Regular Equivalence......Page 275
XI.5.4 Block Modeling......Page 276
Formal Notations......Page 278
Finding the Best Block Model......Page 279
Block Modeling of the Hijacker Network......Page 280
XI.6 PATTERN MATCHING IN NETWORKS......Page 284
XI.7.2 UCINET......Page 285
Section XI.6......Page 286
XII Text Mining Applications......Page 287
XII.1.2 Generalized Background Knowledge versus Specialized Background Knowledge......Page 288
XII.1.3 Leveraging Preset Queries and Constraints in Generalized Browsing Interfaces......Page 290
XII.2 CORPORATE FINANCE: MINING INDUSTRY LITERATURE FOR BUSINESS INTELLIGENCE......Page 293
Data and Background Knowledge Sources......Page 295
Preprocessing Operations......Page 296
Core Mining Operations and Refinement Constraints......Page 298
Presentation Layer – GUI and Visualization Tools......Page 299
Examining the Biotech Industry Trade Press for Information on Merger Activity......Page 302
Exploring Corporate Earnings Announcements......Page 303
Exploring Available Information about Drugs Still in Clinical Trials......Page 306
XII.2.3 Citations and Notes......Page 308
XII.3 A “HORIZONTAL” TEXT MINING APPLICATION: PATENT ANALYSIS SOLUTION LEVERAGING A COMMERCIAL TEXT ANALYTICS PLATFORM......Page 309
XII.3.1 Patent Researcher: Basic Architecture and Functionality......Page 310
Preprocessing Operations......Page 311
Core Mining Operations and Refinement Constraints......Page 312
Presentation Layer – GUI and Visualization Tools......Page 313
XII.3.2 Application Usage Scenarios......Page 314
Looking at the Frequency Distributions among Patents in the Document Collection......Page 315
Exploring Trends in Issued Patents......Page 317
XII.4 LIFE SCIENCES RESEARCH: MINING BIOLOGICAL PATHWAY INFORMATION WITH GENEWAYS......Page 321
Preprocessing Operations......Page 322
XII.4.2 Implementation and Typical Usage......Page 324
XII.4.3 Citations and Notes......Page 327
A.1 WHAT IS THE DIAL LANGUAGE?......Page 329
A.2 INFORMATION EXTRACTION IN THE DIAL ENVIRONMENT......Page 330
A.4 CONCEPT AND RULE STRUCTURE......Page 332
A.4.1 Context......Page 333
A.5 PATTERN MATCHING......Page 334
A.6 PATTERN ELEMENTS......Page 335
A.6.2 Wordclass Names......Page 336
A.6.3 Thesaurus Names......Page 337
A.6.5 Character-Level Regular Expressions......Page 338
A.7 RULE CONSTRAINTS......Page 339
A.8 CONCEPT GUARDS......Page 340
A.9.1 Extracting People Names Based on Title/Position......Page 341
A.9.3 Using a Thesaurus to Extract Location Names......Page 343
A.9.5 A Simplified Anaphora Resolution Rule for Resolving a Person’s Pronoun......Page 344
A.9.6 Anaphoric Family Relation......Page 345
A.9.7 Meeting between People......Page 346
Bibliography......Page 349
Index......Page 403