Reinforcement Learning for Sequential Decision and Optimal Control

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

As one of the most important Artificial Intelligence (AI) branches, Reinforcement Learning (RL) has attracted increasing attention in recent years. RL is an interdisciplinary field of trial‐and‐error learning and optimal control that promises to provide optimal solutions for decision‐making or control in large‐scale and complex dynamic processes. Have you ever wondered how AlphaZero learns to defeat the top human Go players? Do you have any clues about how an autonomous driving system can gradually develop self-driving skills beyond normal drivers? What is the key that enables AlphaStar to make decisions in Starcraft, a notoriously difficult strategy game that has partial information and complex rules? The core mechanism underlying those recent technical breakthroughs is Reinforcement Learning (RL), a theory that can help an agent to develop the self-evolution ability through continuing environment interactions. In the past few years, the Artificial Intelligence community has witnessed phenomenal success of Reinforcement Learning in various fields, including chess games, computer games and robotic control. RL is also considered to be a promising and powerful tool to create general Artificial Intelligence in the future. As an interdisciplinary field of trial-and-error learning and optimal control, RL resembles how humans reinforce their intelligence by interacting with the environment and provides a principled solution for sequential decision making and optimal control in large-scale and complex problems. Since RL contains a wide range of new concepts and theories, scholars may be plagued by a number of questions: What is the inherent mechanism of reinforcement learning? What is the internal connection between RL and optimal control? How has RL evolved in the past few decades, and what are the milestones? How do we choose and implement practical and effective RL algorithms for real-world scenarios? What are the key challenges that RL faces today, and how can we solve them? What is the current trend of RL research? You can find answers to all those questions in this book. The purpose of the book is to help researchers and practitioners take a comprehensive view of RL and understand the in-depth connection between RL and optimal control. The book includes not only systematic and thorough explanations of theoretical basics but also methodical guidance of practical algorithm implementations. The book intends to provide a comprehensive coverage of both classic theories and recent achievements, and the content is carefully and logically organized, including basic topics such as the main concepts and terminologies of RL, Markov decision process (MDP), Bellman’s optimality condition, Monte Carlo learning, temporal difference learning, stochastic dynamic programming, function approximation, policy gradient methods, approximate dynamic programming, and deep RL, as well as the latest advances in action and state constraints, safety guarantee, reference harmonization, robust RL, partially observable MDP, multiagent RL, inverse RL, offline RL, and so on.

Author(s): Shengbo Eben Li
Publisher: Springer
Year: 2023

Language: English
Pages: 484

About the Author
Preface
Symbols
Abbreviation
Contents
Chapter 1. Introduction to Reinforcement Learning
1.1 History of RL
1.1.1 Dynamic Programming
1.1.2 Trial-and-Error Learning
1.2 Examples of RL Applications
1.2.1 Tic-Tac-Toe
1.2.2 Chinese Go
1.2.3 Autonomous Vehicles
1.3 Key Challenges in Today’s RL
1.3.1 Exploration-Exploitation Dilemma
1.3.2 Uncertainty and Partial Observability
1.3.3 Temporally Delayed Reward
1.3.4 Infeasibility from Safety Constraint
1.3.5 Non-stationary Environment
1.3.6 Lack of Generalizability
1.4 References
Chapter 2. Principles of RL Problems
2.1 Four Elements of RL Problems
2.1.1 Environment Model
2.1.2 State-Action Sample
2.1.3 Policy
2.1.3.1 Tabular Policy
2.1.3.2 Parameterized Policy
2.1.4 Reward Signal
2.1.4.1 Long-term Return
2.1.4.2 Value Function
2.1.4.3 Self-consistency Condition
2.2 Classification of RL Methods
2.2.1 Definition of RL Problems
2.2.2 Bellman’s Principle of Optimality
2.2.2.1 Optimal Value and Optimal Policy
2.2.2.2 Two Kinds of Bellman Equation
2.2.3 Indirect RL Methods
2.2.3.1 Policy Iteration
2.2.3.2 Value Iteration
2.2.4 Direct RL Methods
2.3 A Broad View of RL
2.3.1 Influence of Initial State Distribution
2.3.1.1 When ?*(?) is Inside Policy Set Π
2.3.2 Differences between RL and MPC
2.3.3 Various Combination of Four Elements
2.4 Measures of Learning Performance
2.4.1 Policy Performance
2.4.2 Learning Accuracy
2.4.3 Learning Speed
2.4.4 Sample Efficiency
2.4.5 Approximation Accuracy
2.5 Two Examples of Markov Decision Processes
2.5.1 Example: Indoor Cleaning Robot
2.5.2 Example: Autonomous Driving System
2.6 References
Chapter 3. Model-Free Indirect RL: Monte Carlo
3.1 MC Policy Evaluation
3.2 MC Policy Improvement
3.2.1 Greedy Policy
3.2.2 Policy Improvement Theorem
3.2.3 MC Policy Selection
3.3 On-Policy Strategy vs. Off-Policy Strategy
3.3.1 On-Policy Strategy
3.3.2 Off-Policy Strategy
3.3.2.1 Importance Sampling Transformation
3.3.2.2 Importance Sampling Ratio for State-Values
3.3.2.3 Importance Sampling Ratio for Action-Values
3.4 Understanding Monte Carlo RL from a Broad Viewpoint
3.4.1 On-Policy MC Learning Algorithm
3.4.2 Off-Policy MC Learning Algorithm
3.4.3 Incremental Estimation of Value Function
3.5 Example of Monte Carlo RL
3.5.1 Cleaning Robot in a Grid World
3.5.2 MC with Action-Value Function
3.5.3 Influences of Key Parameters
3.6 References
Chapter 4. Model-Free Indirect RL: Temporal Difference
4.1 TD Policy Evaluation
4.2 TD Policy Improvement
4.2.1 On-Policy Strategy
4.2.2 Off-Policy Strategy
4.3 Typical TD Learning Algorithms
4.3.1 On-Policy TD: SARSA
4.3.2 Off-Policy TD: Q-Learning
4.3.3 Off-Policy TD: Expected SARSA
4.3.4 Recursive Value Initialization
4.4 Unified View of TD and MC
4.4.1 n-Step TD Policy Evaluation
4.4.2 TD-Lambda Policy Evaluation
4.5 Examples of Temporal Difference
4.5.1 Results of SARSA
4.5.2 Results of Q-Learning
4.5.3 Comparison of MC, SARSA, and Q-Learning
4.6 References
Chapter 5. Model-Based Indirect RL: Dynamic Programming
5.1 Stochastic Sequential Decision
5.1.1 Model for Stochastic Environment
5.1.2 Average Cost vs. Discounted Cost
5.1.2.1 Average Cost
5.1.2.2 Discounted Cost
5.1.3 Policy Iteration vs. Value Iteration
5.1.3.1 Concept of Policy Iteration
5.1.3.2 Concept of Value Iteration
5.2 Policy Iteration Algorithm
5.2.1 Policy Evaluation (PEV)
5.2.1.1 Iterative PEV Algorithm
5.2.1.2 Fixed-Point Explanation
5.2.2 Policy Improvement (PIM)
5.2.3 Proof of Convergence
5.2.3.1 Convergence of Iterative PEV
5.2.3.2 Convergence of DP Policy Iteration
5.2.4 Explanation with Newton-Raphson Mechanism
5.3 Value Iteration Algorithm
5.3.1 Explanation with Fixed-point Iteration Mechanism
5.3.2 Convergence of DP Value Iteration
5.3.3 Value Iteration for Problems with Average Costs
5.3.3.1 Finite-Horizon Value Iteration
5.3.3.2 Relative Value Iteration
5.3.3.3 Vanishing Discount Factor Approach
5.4 Stochastic Linear Quadratic Control
5.4.1 Average Cost LQ Control
5.4.2 Discounted Cost LQ Control
5.4.3 Performance Comparison with Simulations
5.5 Additional Viewpoints about DP
5.5.1 Unification of Policy Iteration and Value Iteration
5.5.2 Unification of Model-Based and Model-Free RLs
5.5.3 PEV with Other Fixed-point Iterations
5.5.3.1 Model-Based Policy Evaluation
5.5.3.2 Model-Free Policy Evaluation
5.5.4 Value Iteration with Other Fixed-Point Iterations
5.5.4.1 Model-Based Value Iteration
5.5.4.2 Model-Free Value Iteration
5.5.5 Exact DP with Backward Recursion
5.6 More Definitions of Better Policy and Its Convergence
5.6.1 Penalizing a Greedy Search with Policy Entropy
5.6.2 Expectation-Based Definition for Better Policy
5.6.3 Another Form of Expectation-Based Condition
5.7 Example: Autonomous Car on a Grid Road
5.7.1 Results of DP
5.7.2 Influences of Key Parameters on DP
5.7.3 Influences of Key Parameters on SARSA and Q-learning
5.8 Appendix: Fixed-point Iteration Theory
5.8.1 Procedures of Fixed-Point Iteration
5.8.2 Fixed-Point Iteration for Linear Equations
5.9 References
Chapter 6. Indirect RL with Function Approximation
6.1 Linear Approximation and Basis Functions
6.1.1 Binary Basis Function
6.1.2 Polynomial Basis Function
6.1.3 Fourier Basis Function
6.1.4 Radial Basis Function
6.2 Parameterization of Value Function and Policy
6.2.1 Parameterized Value Function
6.2.2 Parameterized Policy
6.2.3 Choice of State and Action Representation
6.3 Value Function Approximation
6.3.1 On-Policy Value Approximation
6.3.1.1 True-gradient in MC and Semi-gradient in TD
6.3.1.2 Sample-based estimation of value gradient
6.3.1.3 Other designs for value function approximation
6.3.2 Off-Policy Value Approximation
6.3.2.1 Estimation of expected gradient in MC and TD
6.3.2.2 Variance Reduction in Off-policy TD
6.3.3 Deadly Triad Issue
6.4 Policy Approximation
6.4.1 Indirect On-Policy Gradient
6.4.1.1 Policy gradient with action-value function
6.4.1.2 Policy gradient with state-value function
6.4.2 Indirect Off-Policy Gradient
6.4.3 Revisit Better Policy for More Options
6.4.3.1 Penalize greedy search with policy entropy
6.4.3.2 Expectation-based definition for better policy
6.5 Actor-Critic Architecture from Indirect RL
6.5.1 Generic Actor-Critic Algorithms
6.5.2 On-policy AC with Action-Value Function
6.5.3 On-policy AC with State-Value Function
6.6 Example: Autonomous Car on Circular Road
6.6.1 Autonomous Vehicle Control Problem
6.6.2 Design of On-policy Actor-Critic Algorithms
6.6.3 Training Results and Performance Comparison
6.7 References
Chapter 7. Direct RL with Policy Gradient
7.1 Overall Objective Function for Direct RL
7.1.1 Stationary Properties of MDPs
7.1.2 Discounted Objective Function
7.1.3 Average Objective Function
7.2 Likelihood Ratio Gradient in General
7.2.1 Gradient Descent Algorithm
7.2.2 Method I: Gradient Derivation from Trajectory Concept
7.2.2.1 Simplification by temporal causality
7.2.2.2 Vanilla policy gradient with state-action format
7.2.3 Method II: Gradient Derivation from Cascading Concept
7.3 On-Policy Gradient
7.3.1 On-Policy Gradient Theorem
7.3.2 Extension of On-Policy Gradient
7.3.2.1 Monte Carlo policy gradient
7.3.2.2 Add baseline to reduce variance
7.3.2.3 Replace action-value with state-value
7.3.3 How Baseline Really Works
7.4 Off-Policy Gradient
7.4.1 Off-Policy True Gradient
7.4.2 Off-Policy Quasi-Gradient
7.5 Actor-Critic Architecture from Direct RL
7.5.1 Off-policy AC with Advantage Function
7.5.2 Off-policy Deterministic Actor-Critic
7.5.3 State-Of-The-Art AC Algorithms
7.6 Miscellaneous Topics in Direct RL
7.6.1 Tricks in Stochastic Derivative
7.6.1.1 Log-derivative trick
7.6.1.2 Reparameterization trick
7.6.2 Types of Numerical Optimization
7.6.2.1 Derivative-free optimization
7.6.2.2 First-order optimization
7.6.2.3 Second-order optimization
7.6.3 Some Variants of Objective Function
7.6.3.1 Surrogate function optimization
7.6.3.2 Entropy regularization
7.7 References
Chapter 8. Approximate Dynamic Programming
8.1 Discrete-time ADP with Infinite Horizon
8.1.1 Bellman Equation
8.1.1.1 Existence of Bellman equation and optimal policy
8.1.1.2 Why initial state is NOT considered
8.1.2 Discrete-time ADP Framework
8.1.3 Convergence and Stability
8.1.3.1 Closed-loop Stability
8.1.3.2 Convergence of ADP
8.1.4 Convergence in Inexact Policy Iteration
8.1.5 Discrete-time ADP with Parameterized Function
8.2 Continuous-time ADP with Infinite Horizon
8.2.1 Hamilton-Jacobi-Bellman (HJB) Equation
8.2.2 Continuous-time ADP Framework
8.2.3 Convergence and Stability
8.2.3.1 Closed-loop stability
8.2.3.2 Convergence of ADP
8.2.4 Continuous-time ADP with Parameterized Function
8.2.5 Continuous-time Linear Quadratic Control
8.3 How to Run RL and ADP Algorithms
8.3.1 Two Kinds of Environment Models
8.3.2 Implementation of RL/ADP Algorithms
8.3.2.1 State initialization sampling (SIS)
8.3.2.2 State rollout forward (SRF)
8.3.3 Training Efficiency Comparison: RL vs. ADP
8.4 ADP for Tracking Problem and its Policy Structure
8.4.1 Two Types of Time Domain in RHC
8.4.2 Definition of Tracking ADP in Real-time Domain
8.4.3 Reformulate Tracking ADP in Virtual-time Domain
8.4.4 Quantitative Analysis with Linear Quadratic Control
8.4.4.1 Case I: Recursive solutions for finite-horizon problems
8.4.4.2 Case II: Steady-state solutions for infinite-horizon problems
8.4.5 Sampling Mechanism Considering Reference Dynamics
8.5 Methods to Design Finite-Horizon ADPs
8.5.1 Basis of Finite-Horizon ADP Algorithm
8.5.2 Optimal Regulator
8.5.3 Optimal Tracker with Multistage Policy
8.5.4 Optimal Tracker with Recurrent Policy
8.6 Example: Lane Keeping Control on Curved Road
8.6.1 Lane Keeping Problem Description
8.6.2 Design of ADP Algorithm with Neural Network
8.6.3 Training Results of ADP and MPC
8.7 References
Chapter 9. State Constraints and Safety Consideration
9.1 Methods to Handle Input Constraint
9.1.1 Saturated Policy Function
9.1.2 Penalized Utility Function
9.2 Relation of State Constraint and Feasibility
9.2.1 Understanding State Constraint in Two Domains
9.2.2 Definitions of Infeasibility and Feasibility
9.2.3 Constraint Types and Feasibility Analysis
9.2.3.1 Type I: pointwise constraints
9.2.3.2 Type II: barrier constraints
9.2.4 Example of Emergency Braking Control
9.2.5 Classification of Constrained RL/ADP Methods
9.2.5.1 Direct RL/ADP for constrained OCPs
9.2.5.2 Indirect RL/ADP for constrained OCPs
9.3 Penalty Function Methods
9.3.1 Additive Smoothing Penalty
9.3.2 Absorbing State Penalty
9.4 Lagrange Multiplier Methods
9.4.1 Dual Ascent Method
9.4.2 Geometric Interpretation
9.4.3 Augmented Lagrange Multiplier
9.5 Feasible Descent Direction Methods
9.5.1 Feasible Direction and Descent Direction
9.5.1.1 Transformation into linear programming
9.5.1.2 Transformation into quadratic programming
9.5.2 Constrained LP Optimization
9.5.2.1 Lagrange-based LP Method
9.5.2.2 Gradient projection
9.5.3 Constrained QP Optimization
9.5.3.1 Active-set QP
9.5.3.2 Quasi-Newton method
9.6 Actor-Critic-Scenery (ACS) Architecture
9.6.1 Two Types of ACS Architectures
9.6.2 Reachability-based Scenery
9.6.2.1 Reachable feasibility function
9.6.2.2 Risky Bellman equation
9.6.2.3 Model-based reachable ACS algorithm
9.6.3 Solvability-based Scenery
9.6.3.1 Basis of solvability-based scenery
9.6.3.2 Solvable feasibility function
9.6.3.3 Monotonic extension of feasible region
9.6.3.4 Model-based solvable ACS algorithms
9.7 Safety Considerations in the Real World
9.7.1 Two Basic Modes to Train Safe Policy
9.7.2 Model-free ACS with Final Safe Guarantee
9.7.3 Mixed ACS with Safe Environment Interaction
9.7.3.1 Mechanism of safe environment interactions
9.7.3.2 Mixed ACS with regional weighted gradients
9.7.4 Safety Shield Mechanism in Implementation
9.8 References
Chapter 10. Deep Reinforcement Learning
10.1 Introduction to Artificial Neural Network
10.1.1 Neurons
10.1.2 Layers
10.1.2.1 Fully Connected Layer
10.1.2.2 Convolutional Layer
10.1.2.3 Recurrent Layer
10.1.3 Typical Neural Networks
10.2 Training of Artificial Neural Network
10.2.1 Loss Function
10.2.1.1 Cross Entropy
10.2.1.2 Mean Squared Error
10.2.1.3 Bias-Variance Trade-Off
10.2.2 Training Algorithms
10.2.2.1 Backpropagation
10.2.2.2 Backpropagation Through Time
10.3 Challenges of Deep Reinforcement Learning
10.3.1 Challenge: Non-iid Sequential Data
10.3.1.1 Trick: Experience Replay (ExR)
10.3.1.2 Trick: Parallel Exploration (PEx)
10.3.2 Challenge: Easy Divergence
10.3.2.1 Trick: Separated Target Network (STN)
10.3.2.2 Trick: Delayed Policy Update (DPU)
10.3.2.3 Trick: Constrained Policy Update (CPU)
10.3.2.4 Trick: Clipped Actor Criterion (CAC)
10.3.3 Challenge: Overestimation
10.3.3.1 Trick: Double Q-Functions (DQF)
10.3.3.2 Trick: Bounded Double Q-functions (BDQ)
10.3.3.3 Trick: Distributional Return Function (DRF)
10.3.4 Challenge: Sample Inefficiency
10.3.4.1 Trick: Entropy Regularization (EnR)
10.3.4.2 Trick: Soft Value Function (SVF)
10.4 Deep RL Algorithms
10.4.1 Deep Q-Network
10.4.2 Dueling DQN
10.4.3 Double DQN
10.4.4 Trust Region Policy Optimization
10.4.5 Proximal Policy Optimization
10.4.6 Asynchronous Advantage Actor-Critic
10.4.7 Deep Deterministic Policy Gradient
10.4.8 Twin Delayed DDPG
10.4.9 Soft Actor-Critic
10.4.10 Distributional SAC
10.5 References
Chapter 11. Miscellaneous Topics
11.1 Robust RL with Bounded Uncertainty
11.1.1 H-infinity Control and Zero-Sum Game
11.1.2 Linear Version of Robust RL
11.1.3 Nonlinear Version of Robust RL
11.2 Partially Observable MDP
11.2.1 Problem Description of POMDP
11.2.2 Belief State and Separation Principle
11.2.3 Linear Quadratic Gaussian Control
11.3 Meta Reinforcement Learning
11.3.1 Transferable Experience
11.3.1.1 RL2
11.3.1.2 SNAIL
11.3.2 Transferable Policy
11.3.2.1 Pretraining method
11.3.2.2 MAML
11.3.2.3 MAESN
11.3.3 Transferable Loss
11.3.3.1 EPG
11.4 Multi-agent Reinforcement Learning
11.4.1 Stochastic Multi-Agent Games
11.4.2 Fully Cooperative RL
11.4.3 Fully Competitive RL
11.4.4 Multi-agent RL with Hybrid Rewards
11.5 Inverse Reinforcement Learning
11.5.1 Essence of Inverse RL
11.5.2 Max-Margin Inverse RL
11.5.3 Max Entropy Inverse RL
11.6 Offline Reinforcement Learning
11.6.1 Policy constraint offline RL
11.6.2 Model-based Offline RL
11.6.3 Value Function Regularization
11.6.4 Uncertainty-based Offline RL
11.6.5 In-sample Offline RL
11.6.6 Goal-conditioned Imitation Learning
11.7 Major RL Frameworks/Libraries
11.7.1 RLlib
11.7.2 OpenAI Baselines
11.7.3 GOPS
11.7.4 Tianshou
11.8 Major RL Simulation Platforms
11.8.1 OpenAI Gym
11.8.2 TORCS
11.8.3 DeepMind Lab
11.8.4 StarCraft II Environment
11.8.5 Microsoft’s Project Malmo
11.8.6 Unity ML-Agents Toolkit
11.9 References
Chapter 12. Index