Unstructured text, as one of the most important data forms, plays a crucial role in data-driven decision making in domains ranging from social networking and information retrieval to scientific research and healthcare informatics. In many emerging applications, people's information need from text data is becoming multidimensional-they demand useful insights along multiple aspects from a text corpus. However, acquiring such multidimensional knowledge from massive text data remains a challenging task. This book presents data mining techniques that turn unstructured text data into multidimensional knowledge. We investigate two core questions. (1) How does one identify task-relevant text data with declarative queries in multiple dimensions? (2) How does one distill knowledge from text data in a multidimensional space? To address the above questions, we develop a text cube framework. First, we develop a cube construction module that organizes unstructured data into a cube structure, by discovering latent multidimensional and multi-granular structure from the unstructured text corpus and allocating documents into the structure. Second, we develop a cube exploitation module that models multiple dimensions in the cube space, thereby distilling from user-selected data multidimensional knowledge. Together, these two modules constitute an integrated pipeline: leveraging the cube structure, users can perform multidimensional, multigranular data selection with declarative queries; and with cube exploitation algorithms, users can extract multidimensional patterns from the selected data for decision making. The proposed framework has two distinctive advantages when turning text data into multidimensional knowledge: flexibility and label-efficiency. First, it enables acquiring multidimensional knowledge flexibly, as the cube structure allows users to easily identify task-relevant data along multiple dimensions at varied granularities and further distill multidimensional knowledge. Second, the algorithms for cube construction and exploitation require little supervision; this makes the framework appealing for many applications where labeled data are expensive to obtain.
Author(s): Chao Zhang, Jiawei Han
Series: Synthesis Lectures on Data Mining and Knowledge Discovery
Publisher: Morgan & Claypool
Year: 2019
Language: English
Pages: 198
Tags: Data Mining, Knowledge Discovery, Multidimensional Mining, Massive Text Data
Overview......Page 16
Part I: Cube Construction......Page 18
Example Applications......Page 20
Technical Roadmap......Page 21
Task 1: Taxonomy Generation......Page 22
Task 3: Multidimensional Summarization......Page 23
Task 5: Abnormal Event Detection......Page 24
Organization......Page 25
Cube Construction Algorithms......Page 26
Overview......Page 28
Pattern-Based Extraction......Page 31
Clustering-Based Taxonomy Construction......Page 32
Adaptive Term Clustering......Page 33
Spherical Clustering for Topic Splitting......Page 34
Identifying Representative Terms......Page 35
Learning Local Term Embeddings......Page 37
Experimental Setup......Page 38
Qualitative Results......Page 40
Quantitative Analysis......Page 42
Summary......Page 45
Overview......Page 46
Related Work......Page 48
Framework Overview......Page 49
Hierarchical Tree Expansion......Page 50
Taxonomy Global Optimization......Page 55
Experimental Setup......Page 57
Qualitative Results......Page 58
Quantitative Results......Page 60
Summary......Page 63
Overview......Page 64
Latent Variable Models......Page 66
Preliminaries......Page 67
Modeling Class Distribution......Page 68
Generating Pseudo-Documents......Page 70
Neural Models with Self-Training......Page 71
Neural Model Self-Training......Page 72
Instantiating with CNNs and RNNs......Page 73
Experiments......Page 74
Baselines......Page 75
Experiment Settings......Page 76
Experiment Results......Page 77
Parameter Study......Page 79
Case Study......Page 82
Summary......Page 84
Overview......Page 86
Hierarchical Text Classification......Page 88
Pseudo-Document Generation......Page 89
Global Classifier Self-Training......Page 92
Algorithm Summary......Page 94
Experiment Settings......Page 95
Component-Wise Evaluation......Page 98
Summary......Page 101
Cube Exploitation Algorithms......Page 104
Introduction......Page 106
Preliminaries......Page 109
Text Cube Preliminaries......Page 110
Problem Definition......Page 111
Popularity and Integrity......Page 112
Neighborhood-Aware Distinctiveness......Page 113
Overview......Page 116
Hybrid Offline Materialization......Page 117
Optimized Online Processing......Page 121
Experimental Setup......Page 122
Effectiveness Evaluation......Page 123
Efficiency Evaluation......Page 127
Summary......Page 130
Overview......Page 132
Related Work......Page 134
Method Overview......Page 135
The Unsupervised Reconstruction Task......Page 137
The Optimization Procedure......Page 139
Life-Decaying Learning......Page 140
Constraint-Based Learning......Page 141
Experiments......Page 144
Experimental Setup......Page 145
Quantitative Comparison......Page 147
Case Studies......Page 149
Effects of Parameters......Page 152
Downstream Application......Page 154
Summary......Page 156
Overview......Page 158
Bursty Event Detection......Page 160
Preliminaries......Page 161
Method Overview......Page 162
Multimodal Embedding......Page 163
Candidate Generation......Page 165
A Bayesian Mixture Clustering Model......Page 166
Parameter Estimation......Page 167
Features Induced from Multimodal Embeddings......Page 168
Complexity Analysis......Page 169
Experimental Settings......Page 170
Qualitative Results......Page 172
Quantitative Results......Page 175
Scalability Study......Page 176
Summary......Page 177
Summary......Page 180
Future Work......Page 181
Bibliography......Page 184
Authors' Biographies......Page 198