The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intelligence.
Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms. This second edition has been significantly expanded and updated, presenting new topics and updating coverage of other topics.
Like the first edition, this second edition focuses on core online learning algorithms, with the more mathematical material set off in shaded boxes. Part I covers as much of reinforcement learning as possible without going beyond the tabular case for which exact solutions can be found. Many algorithms presented in this part are new to the second edition, including UCB, Expected Sarsa, and Double Learning. Part II extends these ideas to function approximation, with new sections on such topics as artificial neural networks and the Fourier basis, and offers expanded treatment of off-policy learning and policy-gradient methods. Part III has new chapters on reinforcement learning's relationships to psychology and neuroscience, as well as an updated case-studies chapter including AlphaGo and AlphaGo Zero, Atari game playing, and IBM Watson's wagering strategy. The final chapter discusses the future societal impacts of reinforcement learning.
Author(s): Richard S. Sutton; Andrew G. Barto
Series: Adaptive Computation and Machine Learning
Edition: 2 Trimmed
Publisher: MIT Press
Year: 2020
Language: English
City: Cambridge, Massachusetts
Tags: AI; artificial intelligence; machine learning; ML
Preface to the Second Edition
Preface to the First Edition
Summary of Notation
Introduction
Reinforcement Learning
Examples
Elements of Reinforcement Learning
Limitations and Scope
An Extended Example: Tic-Tac-Toe
Summary
Early History of Reinforcement Learning
I Tabular Solution Methods
Multi-armed Bandits
A k-armed Bandit Problem
Action-value Methods
The 10-armed Testbed
Incremental Implementation
Tracking a Nonstationary Problem
Optimistic Initial Values
Upper-Confidence-Bound Action Selection
Gradient Bandit Algorithms
Associative Search (Contextual Bandits)
Summary
Finite Markov Decision Processes
The Agent–Environment Interface
Goals and Rewards
Returns and Episodes
Unified Notation for Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Optimal Value Functions
Optimality and Approximation
Summary
Dynamic Programming
Policy Evaluation (Prediction)
Policy Improvement
Policy Iteration
Value Iteration
Asynchronous Dynamic Programming
Generalized Policy Iteration
Efficiency of Dynamic Programming
Summary
Monte Carlo Methods
Monte Carlo Prediction
Monte Carlo Estimation of Action Values
Monte Carlo Control
Monte Carlo Control without Exploring Starts
Off-policy Prediction via Importance Sampling
Incremental Implementation
Off-policy Monte Carlo Control
*Discounting-aware Importance Sampling
*Per-decision Importance Sampling
Summary
Temporal-Difference Learning
TD Prediction
Advantages of TD Prediction Methods
Optimality of TD(0)
Sarsa: On-policy TD Control
Q-learning: Off-policy TD Control
Expected Sarsa
Maximization Bias and Double Learning
Games, Afterstates, and Other Special Cases
Summary
n-step Bootstrapping
n-step TD Prediction
n-step Sarsa
n-step Off-policy Learning
*Per-decision Methods with Control Variates
Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm
*A Unifying Algorithm: n-step Q()
Summary
Planning and Learning with Tabular Methods
Models and Planning
Dyna: Integrated Planning, Acting, and Learning
When the Model Is Wrong
Prioritized Sweeping
Expected vs. Sample Updates
Trajectory Sampling
Real-time Dynamic Programming
Planning at Decision Time
Heuristic Search
Rollout Algorithms
Monte Carlo Tree Search
Summary of the Chapter
Summary of Part I: Dimensions
II Approximate Solution Methods
On-policy Prediction with Approximation
Value-function Approximation
The Prediction Objective (VE)
Stochastic-gradient and Semi-gradient Methods
Linear Methods
Feature Construction for Linear Methods
Polynomials
Fourier Basis
Coarse Coding
Tile Coding
Radial Basis Functions
Selecting Step-Size Parameters Manually
Nonlinear Function Approximation: Artificial Neural Networks
Least-Squares TD
Memory-based Function Approximation
Kernel-based Function Approximation
Looking Deeper at On-policy Learning: Interest and Emphasis
Summary
On-policy Control with Approximation
Episodic Semi-gradient Control
Semi-gradient n-step Sarsa
Average Reward: A New Problem Setting for Continuing Tasks
Deprecating the Discounted Setting
Differential Semi-gradient n-step Sarsa
Summary
*Off-policy Methods with Approximation
Semi-gradient Methods
Examples of Off-policy Divergence
The Deadly Triad
Linear Value-function Geometry
Gradient Descent in the Bellman Error
The Bellman Error is Not Learnable
Gradient-TD Methods
Emphatic-TD Methods
Reducing Variance
Summary
Eligibility Traces
The -return
TD()
n-step Truncated -return Methods
Redoing Updates: Online -return Algorithm
True Online TD()
*Dutch Traces in Monte Carlo Learning
Sarsa()
Variable and
Off-policy Traces with Control Variates
Watkins's Q() to Tree-Backup()
Stable Off-policy Methods with Traces
Implementation Issues
Conclusions
Policy Gradient Methods
Policy Approximation and its Advantages
The Policy Gradient Theorem
REINFORCE: Monte Carlo Policy Gradient
REINFORCE with Baseline
Actor–Critic Methods
Policy Gradient for Continuing Problems
Policy Parameterization for Continuous Actions
Summary
III Looking Deeper
Psychology
Prediction and Control
Classical Conditioning
Blocking and Higher-order Conditioning
The Rescorla–Wagner Model
The TD Model
TD Model Simulations
Instrumental Conditioning
Delayed Reinforcement
Cognitive Maps
Habitual and Goal-directed Behavior
Summary
Neuroscience
Neuroscience Basics
Reward Signals, Reinforcement Signals, Values, and Prediction Errors
The Reward Prediction Error Hypothesis
Dopamine
Experimental Support for the Reward Prediction Error Hypothesis
TD Error/Dopamine Correspondence
Neural Actor–Critic
Actor and Critic Learning Rules
Hedonistic Neurons
Collective Reinforcement Learning
Model-based Methods in the Brain
Addiction
Summary
Applications and Case Studies
TD-Gammon
Samuel's Checkers Player
Watson's Daily-Double Wagering
Optimizing Memory Control
Human-level Video Game Play
Mastering the Game of Go
AlphaGo
AlphaGo Zero
Personalized Web Services
Thermal Soaring
Frontiers
General Value Functions and Auxiliary Tasks
Temporal Abstraction via Options
Observations and State
Designing Reward Signals
Remaining Issues
Reinforcement Learning and the Future of Artificial Intelligence
References
Index