As online information grows dramatically, search engines such as Google are playing a more and more important role in our lives. Critical to all search engines is the problem of designing an effective retrieval model that can rank documents accurately for a given query. This has been a central research problem in information retrieval for several decades. In the past ten years, a new generation of retrieval models, often referred to as statistical language models, has been successfully applied to solve many different information retrieval problems. Compared with the traditional models such as the vector space model, these new models have a more sound statistical foundation and can leverage statistical estimation to optimize retrieval parameters. They can also be more easily adapted to model non-traditional and complex retrieval problems. Empirically, they tend to achieve comparable or better performance than a traditional model with less effort on parameter tuning. This book systematically reviews the large body of literature on applying statistical language models to information retrieval with an emphasis on the underlying principles, empirically effective language models, and language models developed for non-traditional retrieval tasks. All the relevant literature has been synthesized to make it easy for a reader to digest the research progress achieved so far and see the frontier of research in this area. The book also offers practitioners an informative introduction to a set of practically useful language models that can effectively solve a variety of retrieval problems. No prior knowledge about information retrieval is required, but some basic knowledge about probability and statistics would be useful for fully digesting all the details. Table of Contents: Introduction / Overview of Information Retrieval Models / Simple Query Likelihood Retrieval Model / Complex Query Likelihood Model / Probabilistic Distance Retrieval Model / Language Models for Special Retrieval Tasks / Language Models for Latent Topic Analysis / Conclusions
Author(s): ChengXiang Zhai
Series: Synthesis Lectures on Human Language Technologies
Publisher: Morgan and Claypool Publishers
Year: 2008
Language: English
Commentary: 26886
Pages: 142
Tags: Информатика и вычислительная техника;Искусственный интеллект;Компьютерная лингвистика;
Synthesis Lectures on Human Language Technologies......Page 4
Contents......Page 10
Preface......Page 14
Introduction......Page 18
Basic Concepts in Information Retrieval......Page 19
Statistical Language Models......Page 23
Similarity-Based Models......Page 28
Probabilistic Relevance Models......Page 31
Probabilistic Inference Models......Page 36
Axiomatic Retrieval Framework......Page 37
Decision-Theoretic Retrieval Framework......Page 39
Summary......Page 42
Basic Idea......Page 44
Multinomial D......Page 45
Multiple Poisson D......Page 46
Estimation of D......Page 47
A General Smoothing Strategy using Collection Language Model......Page 48
Jelinek-Mercer Smoothing (Fixed Coefficient Interpolation)......Page 49
Dirichlet Prior Smoothing......Page 50
Interpolation vs. Backoff......Page 51
Comparison of Different Smoothing Methods......Page 52
Smoothing and TF-IDF Weighting......Page 53
Two-Stage Smoothing......Page 55
Exploit Document Prior......Page 56
Summary......Page 57
Cluster-Based Smoothing......Page 60
Document Expansion......Page 61
Beyond Unigram Models......Page 63
Parsimonious Language Models......Page 64
Full Bayesian Query Likelihood......Page 65
Translation Model......Page 66
Summary......Page 67
Difficulty in Supporting Feedback with Query Likelihood......Page 70
Kullback-Leibler Divergence Retrieval Model......Page 72
Model-Based Feedback......Page 75
Markov Chain Query Model Estimation......Page 81
Relevance Model......Page 82
Structured Query Models......Page 86
Negative Relevance Feedback......Page 87
Summary......Page 88
Cross-Lingual Information Retrieval......Page 90
Distributed Information Retrieval......Page 92
Structured Document Retrieval and Combining Representations......Page 93
Personalized and Context-Sensitive Search......Page 95
Expert Finding......Page 97
Passage Retrieval......Page 98
Subtopic Retrieval......Page 99
Modeling Redundancy and Novelty......Page 101
Predicting Query Difficulty......Page 102
Summary......Page 103
Probabilistic Latent Semantic Analysis (PLSA)......Page 104
Latent Dirichlet Allocation (LDA)......Page 108
Extensions of PLSA and LDA......Page 110
Topic Model Labeling......Page 113
Using Topic Models for Retrieval......Page 114
Summary......Page 116
Language Models vs. Traditional Retrieval Models......Page 118
Summary of Research Progress......Page 120
Bibliography......Page 121