This handbook is a comprehensive practical resource on corpus linguistics. It features a range of basic and advanced approaches, methods and techniques in corpus linguistics, from corpus compilation principles to quantitative data analyses. The Handbook is organized in six Parts. Parts I to III feature chapters that discuss key issues and the know-how related to various topics around corpus design, methods and corpus types. Parts IV-V aim to offer a user-friendly introduction to the quantitative analysis of corpus data: for each statistical technique discussed, chapters provide a practical guide with R and come with supplementary online material. Part VI focuses on how to write a corpus linguistic paper and how to meta-analyze corpus linguistic research. The volume can serve as a course book as well as for individual study. It will be an essential reading for students of corpus linguistics as well as experienced researchers who want to expand their knowledge of the field.
Author(s): Magali Paquot; Stefan Th. Gries
Publisher: Springer
Year: 2020
Language: English
Pages: 686
Introduction
References
Contents
Part I Corpus Design
1 Corpus Compilation
1.1 Introduction
1.2 Fundamentals
1.2.1 Representativeness
1.2.2 Issues in Collecting Data for the Corpus
1.2.3 Ethical Considerations
1.2.4 Documenting What Is in the Corpus
1.2.5 Formatting and Enriching the Corpus
1.2.6 Sharing the Corpus
1.2.7 Corpus Comparison
1.3 Critical Assessment and Future Directions
Further Reading
References
2 Corpus Annotation
2.1 Introduction
2.2 Fundamentals
2.2.1 Part-of-Speech Tagging
2.2.2 Lemmatization
2.2.3 Syntactic Parsing
2.2.4 Semantic Annotation
2.2.5 Annotation Accuracy
2.2.6 Practicalities of Annotation
2.3 Critical Assessment and Future Directions
2.4 Tools and Resources
Further Reading
References
3 Corpus Architecture
3.1 Introduction
3.2 Fundamentals
3.2.1 Corpus Macro-structure
3.2.2 Primary Data and Text Representation
3.2.3 Data Models for Document Annotations
3.3 Critical Assessment and Future Directions
3.4 Tools and Resources
Further Reading
References
Part II Corpus methods
4 Analysing Frequency Lists
4.1 Introduction
4.2 Fundamentals
4.2.1 Zipf's Law
4.2.2 Unit of Analysis
4.2.3 Beyond Raw Frequency
4.2.3.1 Normalising Frequency Counts
4.2.3.2 Range and Dispersion
4.3 Critical Assessment and Future Directions
4.3.1 Dealing with Homoforms and Multi-word Units
4.3.2 Application of Dispersion (and other) Statistics
4.3.3 Addressing Reliability in the Validation of Frequency Lists
4.4 Tools and Resources
Further Reading
References
5 Analyzing Dispersion
5.1 Introduction
5.2 Fundamentals
5.2.1 An Overview of Measures of Dispersion
5.2.2 Areas of Application and Validation
5.3 Critical Assessment and Future Directions
5.4 Tools and Resources
Further Reading
References
6 Analysing Keyword Lists
6.1 Introduction
6.2 *-18pt
6.3 Critical Assessment and Future Directions
6.3.1 Corpus Preparation
6.3.2 Focus on Differences
6.3.3 Applications of Statistics
6.3.4 Clusters and N-Grams
6.3.5 Future Directions
6.4 Tools and Resources
6.4.1 Tools
6.4.2 Resources (Word Lists)
Further Reading
References
7 Analyzing Co-occurrence Data
7.1 Introduction
7.1.1 General Introduction
7.2 Fundamentals
7.3 Critical Assessment and Future Directions
7.3.1 Unifying the Most Widely-Used AMs
7.3.2 Additional (Different) Ways to Quantify Basic Co-occurrence
7.3.3 Additional Information to Include
7.4 Tools and Resources
Further Reading
References
8 Analyzing Concordances
8.1 Introduction
8.2 Fundamentals
8.2.1 Sorting and Pruning Concordances
8.2.2 Qualitative Analysis of Concordance Lines
8.2.3 Quantitative Analysis of Concordance Lines
8.2.4 Pedagogical Applications of Concordance Lines
8.3 Critical Assessment and Future Directions
8.4 Tools and Resources
Further Reading
References
9 Programming for Corpus Linguistics
9.1 Introduction
9.2 Fundamentals
9.2.1 The Basic Building Blocks of Software Programs
9.2.2 Choosing a Suitable Language for Programming in Corpus Linguistics
9.3 First Steps in Programming
9.3.1 Case Study 1: Simple Scripts to Load, Clean, and Process Large Batches of Text Data
9.3.1.1 Loading a Corpus File and Showing its Contents
9.3.1.2 Loading a Corpus File, Cleaning it, and Showing Its Contents
9.3.1.3 Loading a Web Page, Cleaning it, and Showing Its Contents
9.3.1.4 Loading an Entire Corpus and Showing its Contents
9.3.2 Case Study 2: Scripting the Core Functions of Corpus Analysis Toolkits
9.3.2.1 Creating a Word-type Frequency List for an Entire Corpus
9.3.2.2 Creating a Key-Word-In-Context (KWIC) Concordancer
9.3.2.3 Creating a “MyConc” Object-Oriented Corpus Analysis Toolkit
9.4 Critical Assessment and Future Directions
9.5 Tools and Resources
Further Reading
References
Part III Corpus types
10 Diachronic Corpora
10.1 Introduction
10.2 Fundamentals
10.2.1 Issues and Challenges of Diachronic Corpus Compilation
10.2.1.1 Identifying the Lectal and Diatypic Properties of Texts
10.2.1.2 Redressing Historical Bias
10.2.1.3 Diachronic Comparability
10.2.2 Issues and Challenges of Text-Internal Annotation
10.2.3 Issues and Challenges Specific to the Analysis of Diachronic Corpora
10.3 Critical Assessment and Future Directions
10.4 Tools and Resources
Further Reading
References
11 Spoken Corpora
11.1 Introduction
11.2 Fundamentals
11.2.1 Raw Data and Different Types of Spoken Corpora
11.2.2 Corpus Annotation
11.2.2.1 Orthographic Transcription
11.2.2.2 POS-Tagging and Lemmatisation
11.2.2.3 Parsing
11.2.2.4 Phonemic and Phonetic Transcription
11.2.2.5 Prosodic Transcription
11.2.2.6 Multi-layered and Time-Aligned Annotation
11.2.3 Data Format and Metadata
11.2.4 Corpus Search
11.3 Critical Assessment and Future Directions
11.4 Tools and Resources
Further Reading
References
12 Parallel Corpora
12.1 Introduction
12.2 Fundamentals
12.2.1 Types of Parallel Corpora
12.2.2 Main Characteristics of Parallel Corpora
12.2.3 Methods of Analysis in Cross-Linguistic Research
12.2.4 Issues and Methodological Challenges
12.2.4.1 Issues and Challenges Specific to the Design of Parallel Corpora
12.2.4.2 Issues and Challenges Specific to the Analysis of Parallel Corpora
12.3 Critical Assessment and Future Directions
12.4 Tools and Resources
12.4.1 Query Tools
12.4.2 Resources
12.4.3 Surveys of Available Parallel Corpora
Further Reading
References
13 Learner Corpora
13.1 Introduction
13.2 Fundamentals
13.2.1 Types of Learner Corpora
13.2.2 Metadata
13.2.3 Annotation
13.2.4 Methods of Analysis
13.3 Critical Assessment and Future Directions
13.4 Tools and Resources
Further Reading
References
14 Child-Language Corpora
14.1 Introduction
14.2 Fundamentals
14.2.1 Recording and Contextual Setting
14.2.2 Subject Sampling
14.2.3 Size of Corpora and Recording Intervals
14.2.4 Transcription
14.2.5 Metadata
14.2.6 Further Annotations
14.2.7 Ethical Considerations
14.3 Critical Assessment and Future Directions
14.4 Tools and Resources
Further Reading
References
15 Web Corpora
15.1 Introduction
15.2 Fundamentals
15.2.1 Web as Corpus
15.2.2 Web for Corpus
15.3 Critical Assessment and Future Directions
15.4 Tools and Resources
15.4.1 Web Corpora
15.4.2 Crawling and Text Processing
Further Reading
References
16 Multimodal Corpora
16.1 Introduction
16.2 Fundamentals
16.2.1 Defining Multimodality and Multimodal Corpora
16.2.2 Multimodality Research in Linguistics
16.2.3 Issues and Methodological Challenges
16.3 Critical Assessment and Future Directions
16.4 Tools and Resources
Further Reading
References
Part IV Exploring Your Data
17 Descriptive Statistics and Visualization with R
17.1 Introduction
17.2 An Introduction to R and RStudio
17.2.1 Installing R and RStudio
17.2.2 Getting Started with R
17.2.2.1 Writing and Running Code
17.2.2.2 Installing and Loading Packages
17.3 Data Handling in R
17.3.1 Preparing the Data
17.3.2 *-24pt
17.3.3 Managing and Saving Data
17.4 Descriptive Statistics
17.4.1 Measures of Central Tendency
17.4.2 Measures of Dispersion
17.4.3 Coefficients of Correlation
17.5 Data Visualization
17.5.1 Barplots
17.5.2 Mosaic Plots
17.5.3 Histograms
17.5.4 Ecdf Plots
17.5.5 Boxplots
17.6 Conclusion
Further Reading
References
18 Cluster Analysis
18.1 Introduction
18.2 Fundamentals
18.2.1 Motivation
18.2.2 Data
18.2.3 Clustering
18.2.3.1 Cluster Definition
18.2.3.2 Proximity in Vector Space
18.2.3.3 Clustering Methods
18.2.3.4 Advanced Topics
18.3 Practical Guide with R
18.3.1 K-means
18.3.2 Hierarchical Clustering
18.3.3 Reporting Results
Further Reading
References
19 Multivariate Exploratory Approaches
19.1 Introduction
19.2 Fundamentals
19.2.1 Commonalities
19.2.2 Differences
19.2.3 Exploring is not Predicting
19.2.4 Correspondence Analysis
19.2.5 Multiple Correspondence Analysis
19.2.6 Principal Component Analysis
19.2.7 Exploratory Factor Analysis
19.3 Practical Guide with R
19.3.1 Correspondence Analysis
19.3.2 Multiple Correspondence Analysis
19.3.3 Principal Component Analysis
19.3.4 Exploratory Factor Analysis
19.3.5 Reporting Results
Further Reading
References
Part V Hypothesis-Testing
20 Classical Monofactorial (Parametric and Non-parametric) Tests
20.1 Introduction
20.2 Fundamentals
20.2.1 Null-Hypothesis Significance Testing (NHST) Paradigm
20.2.2 Statistical Tests and their Assumptions
20.2.2.1 Chi-Squared Test
20.2.2.2 T-test
20.2.2.3 ANOVA
20.2.2.4 Mann-Whitney U Test
20.2.2.5 Kruskal-Wallis Test
20.2.2.6 Pearson's Correlation
20.2.2.7 Non-parametric Correlation Tests
20.2.3 Effect Sizes and Confidence Intervals
20.3 Practical Guide with R
20.3.1 Chi-Squared Test*-12pt
20.3.2 T-test*-12pt
20.3.3 Cohen's d with 95% Confidence Intervals – To Be Computed with T-test*-12pt
20.3.4 ANOVA*-12pt
20.3.5 Post-hoc T-Test with Correction for Multiple Testing*-12pt
20.3.6 Mann-Whitney U Test*-12pt
20.3.7 Kruskal-Wallis Test*-12pt
20.3.8 Pearson's and Spearman's Correlations
Further Reading
References
21 Fixed-Effects Regression Modeling
21.1 Introduction
21.2 Fundamentals
21.2.1 (Multiple) Linear Regression
21.2.1.1 An Example of (Multiple) Linear Regression
21.2.1.2 Assumptions of Linear Regression
21.2.2 Binary Logistic Regression
21.2.2.1 An Example of Binary Logistic Regression
21.2.2.2 Assumptions of Binary Logistic Regression
21.2.3 *5pc
21.3 Practical Guide with R
21.3.1 Multiple Linear Regression
21.3.1.1 Creating an Artificial Dataset for a Multiple Linear Regression
21.3.1.2 Running a Multiple Linear Regression
21.3.1.3 What Happens If We Make the Effect Sizes Smaller?
21.3.1.4 What Happens If We Make the Effects “Noisier”?
21.3.1.5 Manufacturing an Interaction Effect
21.3.2 Binary Logistic Regression
21.3.2.1 Creating an Artificial Dataset for a Binary Logistic Regression
21.3.2.2 Running a Binary Logistic Regression
21.3.2.3 Visualizing the Effects of a Binomial Logistic Regression
21.3.2.4 Manufacturing and Visualizing an Interaction Effect
21.3.3 Reporting the Results of Regression Analyses
Further Reading
References
22 Mixed-Effects Regression Modeling
22.1 Introduction
22.2 Fundamentals
22.2.1 When Are Random Effects Useful?
22.2.1.1 Crossed and Nested Effects
22.2.1.2 Hierarchical/Multilevel Modeling
22.2.1.3 Random Slopes as Interactions
22.2.2 Model Specification and Modeling Assumptions
22.2.2.1 Simple Random Intercepts
22.2.2.2 Choosing Between Random and Fixed Effects
22.2.2.3 Model Quality
22.2.2.4 More Complex Models
Representative Study 1
Representative Study 2
22.3 Practical Guide with R
22.3.1 Specifying Models Using lme4 in R
22.3.1.1 Overview of the Data Set
22.3.1.2 A Simple Varying Intercept Instead of a Fixed Effect
22.3.1.3 More Complex Models
Further Reading
References
23 Generalized Additive Mixed Models
23.1 Introduction
23.2 Fundamentals
23.2.1 The Generalized Linear Model
23.2.2 The Generalized Additive Model
23.3 Practical Guide with R
23.3.1 A Main-Effects Model
23.3.2 A Model with Interactions
23.3.3 Random Effects in GAMs
23.3.4 Extensions of GAMs
Further Reading
References
24 Bootstrapping Techniques
24.1 Introduction
24.2 Fundamentals of Bootstrapping
24.2.1 Objectives and Methods
24.2.2 Applications of Bootstrapping in Corpus Linguistics
24.2.2.1 Estimating Sampling Distributions
24.2.2.2 Measuring Corpus Homogeneity
24.2.2.3 Validating Statistical Models
24.2.2.4 Random Forest Analysis
24.2.2.5 Additional Applications of Bootstrapping
24.3 Practical Guide with R
Further Reading
References
25 Conditional Inference Trees and Random Forests
25.1 Introduction
25.2 Fundamentals
25.2.1 Types of Data
25.2.2 The Assumptions
25.2.3 Research Questions
25.2.4 The Algorithms
25.2.4.1 The CIT Algorithm
25.2.4.2 The CRF Algorithm
25.2.5 CITs and CRFs Compared with Other Recursive Partitioning Methods
25.2.6 Situations When the Use of CITs and CRFs May Be Problematic
25.3 A Practical Guide with R
25.3.1 T/V Forms in Russian: Theoretical Background and Research Question
25.3.2 Data: Film Subtitles
25.3.3 Variables
25.3.4 Software
25.3.5 Conditional Inference Tree
25.3.6 Conditional Random Forest
25.3.7 Interpretation of the Predictor Effects: Partial Dependence Plots
25.3.8 Conclusions and Recommendations for Reporting the Results
Further Reading
References
Part VI Pulling Everything Together
26 Writing up a Corpus-Linguistic Paper
26.1 The Structure of an Empirical Paper
26.2 The `Methods' Section
26.3 The `Results' Section
26.4 Concluding Remarks
References
27 Meta-analyzing Corpus Linguistic Research
27.1 Introduction
27.2 Fundamentals
27.3 A Practical Guide to Meta-analysis with R
27.3.1 Defining the Domain and Searching for Primary Literature
27.3.2 Developing and Implementing a Coding Scheme
27.3.3 Aggregating Effect Sizes
27.3.4 Aggregating Effects and Interpreting Results
27.4 Critical Assessment and Future Directions
27.5 Conclusion
27.6 Tools and Resources
Further Reading
References