Text is everywhere, and it is a fantastic resource for social scientists. However, because it is so abundant, and because language is so variable, it is often difficult to extract the information we want. A whole subfield of AI is concerned with text analysis (natural language processing). Many of the basic analysis methods developed are now readily available as Python implementations. This Element will teach you when to use which method, the mathematical background of how it works, and the Python
code to implement it.
Author(s): Dirk Hovy
Publisher: Cambridge University Press
Year: 2020
Language: English
City: Cambridge
Cover
Title page
Copyright page
Text Analysis in Python for Social Scientists
Contents
Introduction
Background
1 Prerequisites
2 What's in a Word
2.1 Word Descriptors
2.1.1 Tokens and Splitting
2.1.2 Lemmatization
2.1.3 Stemming
2.1.4 n-Grams
2.2 Parts of Speech
2.3 Stopwords
2.4 Named Entities
2.5 Syntax
2.6 Caveats – What If It’s Not English?
3 Regular Expressions
4 Pointwise Mutual Information
5 Representing Text
5.1 Enter the Matrix
5.2 Discrete Representations
5.2.1 n-gram Features
5.2.2 Syntactic Features
5.2.3 TF-IDF Counts
5.2.4 Dictionary Methods
5.3 Distributed Representations
5.3.1 Cosine Similarity
5.3.2 Word Embeddings
5.3.3 Document Embeddings
5.4 Discrete versus Continuous
Exploration: Finding Structure in the Data
6 Matrix Factorization
6.1 Dimensionality Reduction
6.1.1 Singular Value Decomposition
6.1.2 Nonnegative Matrix Factorization
6.2 Visualization
6.2.1 t-SNE
6.3 Word Descriptors
6.4 Comparison
7 Clustering
7.1 k-Means
7.2 Agglomerative Clustering
7.3 Comparison
7.4 Choosing the Number of Clusters
7.5 Evaluation
8 Language Models
8.1 The Markov Assumption
8.2 Trigram LMs
8.3 Maximum Likelihood Estimation (MLE)
8.4 Probability of a Sentence
8.5 Smoothing
8.6 Generation
9 Topic Models
9.1 Caveats
9.2 Implementation
9.3 Selection and Evaluation
9.4 Adding Structure
9.5 Adding Constraints
9.6 Topics versus Clusters
Appendix A: English Stopwords
Appendix B: Probabilities
B1 Joint Probability
B2 Conditional Probability
B3 Probability Distributions
B3.1 Uniform distribution
B3.2 Bernoulli distribution
B3.3 Multinomial Distribution
B3.4 Dirichlet Distribtution
References
Acknowledgments