Latent semantic mapping (LSM) is a generalization of latent semantic analysis (LSA), a paradigm originally developed to capture hidden word patterns in a text document corpus. In information retrieval, LSA enables retrieval on the basis of conceptual content, instead of merely matching words between queries and documents. It operates under the assumption that there is some latent semantic structure in the data, which is partially obscured by the randomness of word choice with respect to retrieval. Algebraic and/or statistical techniques are brought to bear to estimate this structure and get rid of the obscuring noise. This results in a parsimonious continuous parameter description of words and documents, which then replaces the original parameterization in indexing and retrieval. This approach exhibits three main characteristics: Discrete entities (words and documents) are mapped onto a continuous vector space; this mapping is determined by global correlation patterns; and dimensionality reduction is an integral part of the process. Such fairly generic properties are advantageous in a variety of different contexts, which motivates a broader interpretation of the underlying paradigm. The outcome (LSM) is a data-driven framework for modeling meaningful global relationships implicit in large volumes of (not necessarily textual) data. This monograph gives a general overview of the framework, and underscores the multifaceted benefits it can bring to a number of problems in natural language understanding and spoken language processing. It concludes with a discussion of the inherent tradeoffs associated with the approach, and some perspectives on its general applicability to data-driven information extraction.
Author(s): Jerome R. Bellegarda
Year: 2008
Language: English
Pages: 112
book.pdf......Page 0
I Principles......Page 11
Introduction......Page 12
Motivation......Page 13
From LSA to LSM......Page 14
Part I......Page 17
Part III......Page 18
Co-occurrence Matrix......Page 19
Singular Value Decomposition......Page 20
Interpretation......Page 21
LSM Feature Space......Page 24
Unit--Unit Comparisons......Page 25
Unit--Composition Comparisons......Page 26
LSM Framework Extension......Page 27
Salient Characteristics......Page 28
Computational Effort......Page 30
Off--Line Cost......Page 31
Possible Shortcuts......Page 32
Other Matrix Decompositions......Page 33
Alternative Formulations......Page 34
Composition Model......Page 35
Unit Model......Page 36
Probabilistic Latent Semantic Analysis......Page 37
Inherent Limitations......Page 39
II Applications......Page 41
Junk E--Mail Filtering......Page 42
Header Analysis......Page 43
Machine Learning Approaches......Page 44
LSM-Based Filtering......Page 45
Performance......Page 48
Semantic Classification......Page 50
Case Study: Desktop Interface Control......Page 51
Language Modeling Constraints......Page 52
Illustration......Page 53
Caveats......Page 54
Language Modeling......Page 58
N-Gram Limitations......Page 59
Hybrid Formulation......Page 60
Context Scope Selection......Page 61
Smoothing......Page 62
Document Smoothing......Page 63
Joint Smoothing......Page 64
Top--Down Approaches......Page 65
Illustration......Page 66
Bottom--Up Approaches......Page 67
Orthographic Neighborhoods......Page 68
Sequence Alignment......Page 69
Speaker Verification......Page 72
The Task......Page 73
Single-Utterance Representation......Page 74
LSM-Tailored Metric......Page 75
Integration with DTW......Page 77
TTS Unit Selection......Page 80
Concatenative Synthesis......Page 81
Feature Extraction......Page 82
Comparison to Fourier Analysis......Page 83
Properties......Page 84
LSM-Based Boundary Training......Page 85
III Perspectives......Page 87
Discussion......Page 88
Descriptive Power......Page 89
Domain Sensitivity......Page 90
Natural Language Processing......Page 91
Generic Pattern Recognition......Page 92
Conclusion......Page 94
Summary......Page 95
Perpectives......Page 96