The World Wide Web constitutes the largest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. There are several adavantages of this approach: (i) Working with such corpora obviates the problems encountered when using Internet search engines in quantitative linguistic research (such as non-transparent ranking algorithms). (ii) Creating a corpus from web data is virtually free. (iii) The size of corpora compiled from the WWW may exceed by several orders of magnitudes the size of language resources offered elsewhere. (iv) The data is locally available to the user, and it can be linguistically post-processed and queried with the tools preferred by her/him. This book addresses the main practical tasks in the creation of web corpora up to giga-token size. Among these tasks are the sampling process (i.e., web crawling) and the usual cleanups including boilerplate removal and removal of duplicated content. Linguistic processing and problems with linguistic processing coming from the different kinds of noise in web corpora are also covered. Finally, the authors show how web corpora can be evaluated and compared to other corpora (such as traditionally compiled corpora).
Author(s): Schäfer, Roland; Bildhauer, Felix
Series: Synthesis Lectures on Human Language Technologies
Publisher: Morgan & Claypool Publishers
Year: 2013
Language: English
Commentary: Print ISBN: 9781608459834 | Category: Library Science, Computer Science/IT
Pages: 147
Tags: Internet searching Linguistics Web search engines
Preface
Acknowledgments
Web Corpora
Data Collection
Introduction
The Structure of the Web
General Properties
Accessibility and Stability of Web pages
What's in a (National) Top Level Domain?
Problematic Segments of the Web
Crawling Basics
Introduction
Corpus Construction From Search Engine Results
Crawlers and Crawler Performance
Configuration Details and Politeness
Seed URL Generation
More on Crawling Strategies
Introduction
Biases and the PageRank
Focused Crawling
Post-Processing
Introduction
Basic Cleanups
HTML stripping
Character References and Entities
Character Sets and Conversion
Further Normalization
Boilerplate Removal
Introduction to Boilerplate
Feature Extraction
Choice of the Machine Learning Method
Language Identification
Duplicate Detection
Types of Duplication
Perfect Duplicates and Hashing
Near Duplicates, Jaccard Coefficients, and Shingling
Linguistic Processing
Introduction
Basics of Tokenization, Part-Of-Speech Tagging, and Lemmatization
Tokenization
Part-Of-Speech Tagging
Lemmatization
Linguistic Post-Processing of Noisy Data
Introduction
Treatment of Noisy Data
Tokenizing Web Texts
Example: Missing Whitespace
Example: Emoticons
POS Tagging and Lemmatization of Web Texts
Tracing Back Errors in POS Tagging
Orthographic Normalization
Software for Linguistic Post-Processing
Corpus Evaluation and Comparison
Introduction
Rough Quality Check
Word and Sentence Lengths
Duplication
Measuring Corpus Similarity
Inspecting Frequency Lists
Hypothesis Testing with 2
Hypothesis Testing with Spearman's Rank Correlation
Using Test Statistics without Hypothesis Testing
Comparing Keywords
Keyword Extraction with 2
Keyword Extraction Using the Ratio of Relative Frequencies
Variants and Refinements
Extrinsic Evaluation
Corpus Composition
Estimating Corpus Composition
Measuring Corpus Composition
Interpreting Corpus Composition
Summary
Bibliography
Authors' Biographies