Foundation Models for Natural Language Processing: Pre-trained Language Models Integrating Media

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This book provides a comprehensive overview of the state of the art in research and applications of Foundation Models and is intended for readers familiar with basic Natural Language Processing (NLP) concepts. Over the recent years, a revolutionary new paradigm has been developed for training models for NLP. These models are first pre-trained on large collections of text documents to acquire general syntactic knowledge and semantic information. Then, they are fine-tuned for specific tasks, which they can often solve with superhuman accuracy. When the models are large enough, they can be instructed by prompts to solve new tasks without any fine-tuning. Moreover, they can be applied to a wide range of different media and problem domains, ranging from image and video processing to robot control learning. Because they provide a blueprint for solving many tasks in artificial intelligence, they have been called Foundation Models. After a brief introduction to basic NLP models the main pre-trained language models BERT, GPT and sequence-to-sequence transformer are described, as well as the concepts of self-attention and context-sensitive embedding. Then, different approaches to improving these models are discussed, such as expanding the pre-training criteria, increasing the length of input texts, or including extra knowledge. An overview of the best-performing models for about twenty application areas is then presented, e.g., question answering, translation, story generation, dialog systems, generating images from text, etc. For each application area, the strengths and weaknesses of current models are discussed, and an outlook on further developments is given. In addition, links are provided to freely available program code. A concluding chapter summarizes the economic opportunities, mitigation of risks, and potential developments of AI.

Author(s): Gerhard Paaß, Sven Giesselbach
Publisher: Springer
Year: 2023

Language: English
Pages: 448

Foreword
Preface
Acknowledgments
Contents
About the Authors
1 Introduction
1.1 Scope of the Book
1.2 Preprocessing of Text
1.3 Vector Space Models and Document Classification
1.4 Nonlinear Classifiers
1.5 Generating Static Word Embeddings
1.6 Recurrent Neural Networks
1.7 Convolutional Neural Networks
1.8 Summary
References
2 Pre-trained Language Models
2.1 BERT: Self-Attention and Contextual Embeddings
2.1.1 BERT Input Embeddings and Self-Attention
Self-Attention to Generate Contextual Embeddings
2.1.2 Training BERT by Predicting Masked Tokens
2.1.3 Fine-Tuning BERT to Downstream Tasks
2.1.4 Visualizing Attentions and Embeddings
2.1.5 Natural Language Understanding by BERT
BERT's Performance on Other Fine-Tuning Tasks
2.1.6 Computational Complexity
2.1.7 Summary
2.2 GPT: Autoregressive Language Models
2.2.1 The Task of Autoregressive Language Models
2.2.2 Training GPT by Predicting the Next Token
Visualizing GPT Embeddings
2.2.3 Generating a Sequence of Words
2.2.4 The Advanced Language Model GPT-2
2.2.5 Fine-Tuning GPT
2.2.6 Summary
2.3 Transformer: Sequence-to-Sequence Translation
2.3.1 The Transformer Architecture
Cross-Attention
2.3.2 Decoding a Translation to Generate the Words
2.3.3 Evaluation of a Translation
2.3.4 Pre-trained Language Models and Foundation Models
Available Implementations
2.3.5 Summary
2.4 Training and Assessment of Pre-trained Language Models
2.4.1 Optimization of PLMs
Basics of PLM Optimization
Variants of Stochastic Gradient Descent
Parallel Training for Large Models
2.4.2 Regularization of Pre-trained Language Models
2.4.3 Neural Architecture Search
2.4.4 The Uncertainty of Model Predictions
Bayesian Neural Networks
Estimating Uncertainty by a Single Deterministic Model
Representing the Predictive Distribution by Ensembles
2.4.5 Explaining Model Predictions
Linear Local Approximations
Nonlinear Local Approximations
Explanation by Retrieval
Explanation by Generating a Chain of Thought
2.4.6 Summary
References
3 Improving Pre-trained Language Models
3.1 Modifying Pre-training Objectives
3.1.1 Autoencoders Similar to BERT
3.1.2 Autoregressive Language Models Similar to GPT
3.1.3 Transformer Encoder-Decoders
3.1.4 Systematic Comparison of Transformer Variants
3.1.5 Summary
3.2 Capturing Longer Dependencies
3.2.1 Sparse Attention Matrices
3.2.2 Hashing and Low-Rank Approximations
3.2.3 Comparisons of Transformers with Long Input Sequences
3.2.4 Summary
3.3 Multilingual Pre-trained Language Models
3.3.1 Autoencoder Models
3.3.2 Seq2seq Transformer Models
3.3.3 Autoregressive Language Models
3.3.4 Summary
3.4 Additional Knowledge for Pre-trained Language Models
3.4.1 Exploiting Knowledge Base Embeddings
3.4.2 Pre-trained Language Models for Graph Learning
3.4.3 Textual Encoding of Tables
3.4.4 Textual Encoding of Knowledge Base Relations
3.4.5 Enhancing Pre-trained Language Models by Retrieved Texts
3.4.6 Summary
3.5 Changing Model Size
3.5.1 Larger Models Usually Have a better Performance
3.5.2 Mixture-of-Experts Models
3.5.3 Parameter Compression and Reduction
3.5.4 Low-Rank Factorization
3.5.5 Knowledge Distillation
3.5.6 Summary
3.6 Fine-Tuning for Specific Applications
3.6.1 Properties of Fine-Tuning
Catastrophic Forgetting
Fine-Tuning and Overfitting
3.6.2 Fine-Tuning Variants
Fine-Tuning in Two Stages
Fine-Tuning for Multiple Tasks
Meta-Learning to Accelerate Fine-Tuning
Fine-Tuning a Frozen Model by Adapters
Fine-Tuning GPT-3
3.6.3 Creating Few-Shot Prompts
3.6.4 Thought Chains for Few-Shot Learning of Reasoning
3.6.5 Fine-Tuning Models to Execute Instructions
InstructGPT Results
Instruction Tuning with FLAN
3.6.6 Generating Labeled Data by Foundation Models
3.6.7 Summary
References
4 Knowledge Acquired by Foundation Models
4.1 Benchmark Collections
4.1.1 The GLUE Benchmark Collection
4.1.2 SuperGLUE: An Advanced Version of GLUE
4.1.3 Text Completion Benchmarks
4.1.4 Large Benchmark Collections
4.1.5 Summary
4.2 Evaluating Knowledge by Probing Classifiers
4.2.1 BERT's Syntactic Knowledge
4.2.2 Common Sense Knowledge
4.2.3 Logical Consistency
Improving Logical Consistency
4.2.4 Summary
4.3 Transferability and Reproducibility of Benchmarks
4.3.1 Transferability of Benchmark Results
Benchmarks May Not Test All Aspects
Logical Reasoning by Correlation
4.3.2 Reproducibility of Published Results in Natural Language Processing
Available Implementations
4.3.3 Summary
References
5 Foundation Models for Information Extraction
5.1 Text Classification
5.1.1 Multiclass Classification with Exclusive Classes
5.1.2 Multilabel Classification
5.1.3 Few- and Zero-Shot Classification
Available Implementations
5.1.4 Summary
5.2 Word Sense Disambiguation
5.2.1 Sense Inventories
5.2.2 Models
Available Implementations
5.2.3 Summary
5.3 Named Entity Recognition
5.3.1 Flat Named Entity Recognition
5.3.2 Nested Named Entity Recognition
Available Implementations
5.3.3 Entity Linking
Available Implementations
5.3.4 Summary
5.4 Relation Extraction
5.4.1 Coreference Resolution
Available Implementations
5.4.2 Sentence-Level Relation Extraction
5.4.3 Document-Level Relation Extraction
5.4.4 Joint Entity and Relation Extraction
Aspect-Based Sentiment Analysis
Semantic Role Labeling
Extracting Knowledge Graphs from Pre-trained PLMs
5.4.5 Distant Supervision
5.4.6 Relation Extraction Using Layout Information
Available Implementations
5.4.7 Summary
References
6 Foundation Models for Text Generation
6.1 Document Retrieval
6.1.1 Dense Retrieval
6.1.2 Measuring Text Retrieval Performance
6.1.3 Cross-Encoders with BERT
6.1.4 Using Token Embeddings for Retrieval
6.1.5 Dense Passage Embeddings and Nearest Neighbor Search
Available Implementations
6.1.6 Summary
6.2 Question Answering
6.2.1 Question Answering Based on Training Data Knowledge
Fine-Tuned Question Answering Models
Question Answering with Few-Shot Language Models
6.2.2 Question Answering Based on Retrieval
6.2.3 Long-Form Question Answering Using Retrieval
A Language Model with Integrated Retrieval
Controlling a Search Engine by a Pre-trained Language Model
Available Implementations
6.2.4 Summary
6.3 Neural Machine Translation
6.3.1 Translation for a Single Language Pair
6.3.2 Multilingual Translation
6.3.3 Multilingual Question Answering
Available Implementations
6.3.4 Summary
6.4 Text Summarization
6.4.1 Shorter Documents
6.4.2 Longer Documents
6.4.3 Multi-Document Summarization
Available Implementations
6.4.4 Summary
6.5 Text Generation
6.5.1 Generating Text by Language Models
6.5.2 Generating Text with a Given Style
Style-Conditional Probabilities
Prompt-Based Generation
6.5.3 Transferring a Document to Another Text Style
Style Transfer with Parallel Data
Style Transfer without Parallel Data
Style Transfer with Few-Shot Prompts
6.5.4 Story Generation with a Given Plot
Specify a Storyline by Keywords or Phrases
Specify a Storyline by Sentences
Other Control Strategies
6.5.5 Generating Fake News
Detecting Fake News
6.5.6 Generating Computer Code
Available Implementations
6.5.7 Summary
6.6 Dialog Systems
6.6.1 Dialog Models as a Pipeline of Modules
6.6.2 Advanced Dialog Models
6.6.3 LaMDA and BlenderBot 3 Using Retrieval and Filters
6.6.4 Limitations and Remedies of Dialog Systems
Available Implementations
6.6.5 Summary
References
7 Foundation Models for Speech, Images, Videos, and Control
7.1 Speech Recognition and Generation
7.1.1 Basics of Automatic Speech Recognition
7.1.2 Transformer-Based Speech Recognition
7.1.3 Self-supervised Learning for Speech Recognition
Available Implementations
7.1.4 Text-to-Speech
Available Implementations
7.1.5 Speech-to-Speech Language Model
7.1.6 Music Generation
Available Implementations
7.1.7 Summary
7.2 Image Processing and Generation
7.2.1 Basics of Image Processing
7.2.2 Vision Transformer
7.2.3 Image Generation
7.2.4 Joint Processing of Text and Images
7.2.5 Describing Images by Text
7.2.6 Generating Images from Text
7.2.7 Diffusion Models Restore an Image Destructed by Noise
7.2.8 Multipurpose Models
Available Implementations
7.2.9 Summary
7.3 Video Interpretation and Generation
7.3.1 Basics of Video Processing
7.3.2 Video Captioning
7.3.3 Action Recognition in Videos
7.3.4 Generating Videos from Text
Available Implementations
7.3.5 Summary
7.4 Controlling Dynamic Systems
7.4.1 The Decision Transformer
7.4.2 The GATO Model for Text, Images and Control
Available Implementations
7.4.3 Summary
7.5 Interpretation of DNA and Protein Sequences
7.5.1 Summary
References
8 Summary and Outlook
8.1 Foundation Models Are a New Paradigm
8.1.1 Pre-trained Language Models
8.1.2 Jointly Processing Different Modalities by Foundation Models
8.1.3 Performance Level of Foundation Models
Capturing Knowledge Covered by Large Text Collections
Information Extraction
Text Processing and Text Generation
Multimedia Processing
8.1.4 Promising Economic Solutions
8.2 Potential Harm from Foundation Models
8.2.1 Unintentionally Generate Biased or False Statements
Accidentally Generated False or Misleading Information
Reducing Bias by Retrieval
Filtering Biased Text
8.2.2 Intentional Harm Caused by Foundation Models
Fake Images Created by Foundation Models
Surveillance and Censorship
8.2.3 Overreliance or Treating a Foundation Model as Human
8.2.4 Disclosure of Private Information
8.2.5 Society, Access, and Environmental Harms
Access to Foundation Models
Energy Consumption of Foundation Models
Foundation Models Can Cause Unemployment and Social Inequality
Foundation Models Can Promote a Uniform World View and Culture
A Legal Regulation of Foundation Models Is Necessary
8.3 Advanced Artificial Intelligence Systems
8.3.1 Can Foundation Models Generate Innovative Content?
8.3.2 Grounding Language in the World
8.3.3 Fast and Slow Thinking
8.3.4 Planning Strategies
References
Appendix A
A.1 Sources and Copyright of Images Used in Graphics
Index