A high school student can create deep Q-learning code to control her robot, without any understanding of the meaning of 'deep' or 'Q', or why the code sometimes fails. This book is designed to explain the science behind reinforcement learning and optimal control in a way that is accessible to students with a background in calculus and matrix algebra.
Author(s): Sean Meyn
Publisher: Cambridge University Press
Year: 2022
Language: english
Pages: 454
Control Systems and Reinforcement Learning
Acknowledgments
Introduction
Part I
Control Crash Course
2.1.1 Failure of an Adaptive Control System
2.3.1 Sufficient Statistics and the Nonlinear State Space Model
2.3.2 State Augmentation and Learning
2.3.3 Linear State Space Model
2.3.4 A Nod to Newton and Leibniz
2.4.1 Total Cost
2.4.2 Stability of Equilibria
2.4.3 Lyapunov Functions
2.4.4 Technical Proofs
2.4.5 Geometry in Continuous Time
2.4.6 Linear State Space Models
2.5.1 Actors and Critics
2.5.2 Temporal Differences
2.5.3 Bandits and Exploration
2.7.1 Wall Street
2.7.2 Mountain Car
2.7.3 MagBall
2.7.4 CartPole
2.7.5 Pendubot and Acrobot
2.7.6 Cooperative Rowing
Optimal Control
3.2.1 Value Iteration
3.2.2 Policy Improvement
3.2.3 Perron-Frobenius Theory: A Gentle Introduction*
3.3.1 Discounted Cost
3.3.2 Shortest Path Problem
3.3.3 Finite Horizon
3.3.4 Model Predictive Control
3.9.1 Mountain Car
3.9.2 Spiders and Flies
3.9.3 Contention for Resources and Instability
3.9.4 Solving the HJB Equation
ODE Methods for Algorithm Design
4.2.1 Euler Approximation for a Linear System
4.4.1 Role of Convexity
4.4.2 Polyak-Lojasiewicz Condition
4.4.3 Euler Approximation
4.4.4 Constrained Optimization
4.5.1 Quasi Monte-Carlo
4.5.2 System Identification
4.5.3 Approximate Policy Improvement
4.5.4 A Brief Tour of QSA Theory
4.5.5 Constant Gain Algorithm
4.5.6 Zap QSA
4.6.1 Simulated Annealing
4.6.2 A Menu of Algorithms
4.7.1 Mountain Car
4.7.2 LQR
4.7.3 What about High Dimensions?
4.8.1 Gronwall’s Inequality
4.8.2 Lyapunov Functions
4.8.3 Gradient Flows
4.8.4 The ODE at ^
4.9.1 Main Results and Some Insights
4.9.2 ODE Solidarity
4.9.3 Criteria for Stability
4.9.4 Deterministic Markovian Model
4.9.5 Convergence Rates
4.11.1 ODE Methods for Algorithm Design
4.11.2 Optimization
4.11.3 QSA
4.11.4 SGD and Extremum Seeking Control
Value Function Approximations
5.1.1 Function Approximation Based on Training Data
5.1.2 Linear Function Approximation
5.1.3 Neural Networks
5.1.4 Kernels
5.1.5 Are We Done Yet?
5.3.1 Fixed-Policy Temporal Difference
5.3.2 Least Squares and Linear Regression
5.3.3 Recursive LSTD and Zap
5.4.1 Galerkin Relaxations and Projection
5.4.2 TD(X)-Learning
5.4.3 Projected Bellman Operator and Q-Learning
5.4.4 GQ-Learning
5.4.5 Batch Methods and DQN
5.5.1 Convex Q-Learning with a Finite-Dimensional Function Class
5.5.2 BCQL and Kernel Methods
5.9.1 Machine Learning
5.9.2 TD-Learning
5.9.3 Q-Learning
Part II
Markov Chains
6.1.1 Notation and Conventions
6.3.1 Example
6.4.1 Critic Methods
6.4.2 Actor Methods
6.6.1 Average Cost
6.6.2 Discounted Cost
6.7.1 Asymptotic Statistics Made Finite
6.7.2 Asymptotic Variance and Mixing Time
6.7.3 Sample Complexity
6.7.4 A Simple Example?
6.7.5 Combating Variance by Design
6.9.1 Classification
6.9.2 Lyapunov Theory
Stochastic Control
7.1.1 Admissible Inputs
7.3.1 Fluid Models and Value Functions
7.4.1 Fluid Model
7.4.2 Computation and Solidarity
7.4.3 Solidarity
7.5.1 Fluid Model Dynamics
7.5.2 DP Equations
7.5.3 Partial Observations
7.7.1 Control with Partial Observations
7.8.1 Bandit Models
7.8.2 Bayesian Bandits
7.8.3 Naive Optimism Can Be Successful
Stochastic Approximation
8.2.1 ODE Design
8.2.2 ODE Approximation
8.2.3 Choice of Step-Size
8.2.4 Multiple Time Scales
8.2.5 Algorithm Performance
8.2.6 Asymptotic versus Transient Performance
8.3.1 Monte Carlo
8.3.2 Stochastic Gradient Descent
8.3.3 Empirical Risk Minimization
8.4.1 Gain Selection
8.4.2 Variance Formulae
8.4.3 Simulations
8.5.1 Approximating the Newton-Raphson Flow
8.5.2 Zap Zero
8.5.3 Stochastic Newton-Raphson
8.6.1 Curse of Condition Number
8.6.2 Curse of Markovian Memory
□
8.7.1 Stability and Convergence
8.7.2 Linearization and Convergence Rates
8.7.3 Polyak-Ruppert Averaging
8.9.1 SA and RL
8.9.2 Stability
8.9.3 Asymptotic Statistics
8.9.4 Less Asymptotic Statistics
Temporal Difference Methods
9.1.1 Fixed-Policy Value Functions and DP Equations
9.1.2 PIA and the Q-Function
PIA for Discounted Cost
9.1.3 Advantage Function
9.2.1 Conditional Expectation and Projection
9.2.2 Linear Independence
9.3.1 Mean-Square Bellman Error
9.3.2 Mean-Square Value Function Error
9.3.3 Projected Bellman Error
9.4.1 Linear Function Class
9.4.2 Nonlinear Parameterizations
9.5.1 Exploration
9.5.2 On and Off Algorithms
9.5.3 Relative TD(X)
9.5.4 TD(X) for Advantage
9.6.1 Optimal Control Essentials
9.6.2 Watkins’s Algorithms
9.6.3 Exploration
9.6.4 ODE Analysis
9.6.5 Variance Matters
9.7.1 Gain Selection
9.7.2 Honest Conclusions
9.8.1 GQ-Learning
9.8.2 Zap Q-Learning
9.9.1 Advantage Function
9.9.2 TD Stability Theory
9.11.1 Temporal Difference Methods
9.11.2 Q-Learning
9.11.3 GQ and Zap
9.11.4 Convex Q-Learning
10
Setting the Stage, Return of the Actors
10.1.1 Linear Operators and Adjoints
10.1.2 Adjoints and Eligibility Vectors
10.1.3 Weighted Norms and Weighted Eligibility
10.2.1 Projection of Advantage and Value
10.2.2 Weighted Norm
10.4.1 Every Other Criterion
10.4.2 Average-Cost Algorithms
10.5.1 Actor-Critic for Average Cost
10.5.2 A Few Warnings and Remedies
10.7.1 Variance Reduction through Advantage
10.7.2 A Better Advantage
10.10.1 Adjoints and TD-Learning
10.10.2 Actor-Critic Methods
10.10.3 Some Ancient History
10.10.4 Fisher Information
Appendix A
Mathematical Background
A.1.1 Generalities
A.1.2 Topology
A.1.3 Linear Algebra
A.2.1 Events and Sample Space
A.2.2 Markov Chain Basics
A.2.3 Strong Markov Property
Appendix B
Markov Decision Processes
B.1.1 Total Cost in Many Flavors
B.2.1 Value Iteration and Successive Approximation
B.2.2 Policy Improvement and Newton-Raphson
B.2.3 LP Formulations
Appendix C
Partial Observations and Belief States