Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

REINFORCEMENT LEARNING AND STOCHASTIC OPTIMIZATION

Clearing the jungle of stochastic optimization

Sequential decision problems, which consist of “decision, information, decision, information,” are ubiquitous, spanning virtually every human activity ranging from business applications, health (personal and public health, and medical decision making), energy, the sciences, all fields of engineering, finance, and e-commerce. The diversity of applications attracted the attention of at least 15 distinct fields of research, using eight distinct notational systems which produced a vast array of analytical tools. A byproduct is that powerful tools developed in one community may be unknown to other communities.

Reinforcement Learning and Stochastic Optimization offers a single canonical framework that can model any sequential decision problem using five core components: state variables, decision variables, exogenous information variables, transition function, and objective function. This book highlights twelve types of uncertainty that might enter any model and pulls together the diverse set of methods for making decisions, known as policies, into four fundamental classes that span every method suggested in the academic literature or used in practice.

Reinforcement Learning and Stochastic Optimization is the first book to provide a balanced treatment of the different methods for modeling and solving sequential decision problems, following the style used by most books on machine learning, optimization, and simulation. The presentation is designed for readers with a course in probability and statistics, and an interest in modeling and applications. Linear programming is occasionally used for specific problem classes. The book is designed for readers who are new to the field, as well as those with some background in optimization under uncertainty.

Throughout this book, readers will find references to over 100 different applications, spanning pure learning problems, dynamic resource allocation problems, general state-dependent problems, and hybrid learning/resource allocation problems such as those that arose in the COVID pandemic. There are 370 exercises, organized into seven groups, ranging from review questions, modeling, computation, problem solving, theory, programming exercises and a “diary problem” that a reader chooses at the beginning of the book, and which is used as a basis for questions throughout the rest of the book.

Author(s): Warren B. Powell
Publisher: Wiley
Year: 2022

Language: English
Pages: 1136
City: Hoboken

Reinforcement Learning and Stochastic Optimization:A Unified Framework for Sequential Decisions
Contents
Preface
Acknowledgments
Part I – Introduction
1 Sequential Decision Problems
1.1 The Audience
1.2 The Communities of Sequential Decision Problems
1.3 Our Universal Modeling Framework
1.4 Designing Policies for Sequential Decision Problems
1.4.1 Policy Search
1.4.2 Policies Based on Lookahead Approximations
1.4.3 Mixing and Matching
1.4.4 Optimality of the Four Classes
1.4.5 Pulling it All Together
1.5 Learning
1.6 Themes
1.6.1 Blending Learning and Optimization
1.6.2 Bridging Machine Learning to Sequential Decisions
1.6.3 From Deterministic to Stochastic Optimization
1.6.4 From Single to Multiple Agents
1.7 Our Modeling Approach
1.8 How to Read this Book
1.8.1 Organization of Topics
1.8.2 How to Read Each Chapter
1.8.3 Organization of Exercises
1.9 Bibliographic Notes
Exercises
Bibliography
2 Canonical Problems and Applications
2.1 Canonical Problems
2.1.1 Stochastic Search – Derivative-based and Derivative-free
2.1.1.1 Derivative-based Stochastic Search
2.1.1.2 Derivative-free Stochastic Search
2.1.2 Decision Trees
2.1.3 Markov Decision Processes
2.1.4 Optimal Control
2.1.5 Approximate Dynamic Programming
2.1.6 Reinforcement Learning
2.1.7 Optimal Stopping
2.1.8 Stochastic Programming
2.1.9 The Multiarmed Bandit Problem
2.1.10 Simulation Optimization
2.1.11 Active Learning
2.1.12 Chance-constrained Programming
2.1.13 Model Predictive Control
2.1.14 Robust Optimization
2.2 A Universal Modeling Framework for Sequential Decision Problems
2.2.1 Our Universal Model for Sequential Decision Problems
2.2.2 A Compact Modeling Presentation
2.2.3 MDP/RL vs. Optimal Control Modeling Frameworks
2.3 Applications
2.3.1 The Newsvendor Problems
2.3.1.1 Basic Newsvendor – Final Reward
2.3.1.2 Basic Newsvendor – Cumulative Reward
2.3.1.3 Contextual Newsvendor
2.3.1.4 Multidimensional Newsvendor Problems
2.3.2 Inventory/Storage Problems
2.3.2.1 Inventory Without Lags
2.3.2.2 Inventory Planning with Forecasts
2.3.2.3 Lagged Decisions
2.3.3 Shortest Path Problems
2.3.3.1 A Deterministic Shortest Path Problem
2.3.3.2 A Stochastic Shortest Path Problem
2.3.3.3 A Dynamic Shortest Path Problem
2.3.3.4 A Robust Shortest Path Problem
2.3.4 Some Fleet Management Problems
2.3.4.1 The Nomadic Trucker
2.3.4.2 From One Driver to a Fleet
2.3.5 Pricing
2.3.6 Medical Decision Making
2.3.7 Scientific Exploration
2.3.8 Machine Learning vs. Sequential Decision Problems
2.4 Bibliographic Notes
Exercises
Bibliography
3 Online Learning
3.1 Machine Learning for Sequential Decisions
3.1.1 Observations and Data in Stochastic Optimization
3.1.2 Indexing Input ?? and Response ??+1
3.1.3 FunctionsWe are Learning
3.1.4 Sequential Learning: From Very Little Data to … More Data
3.1.5 Approximation Strategies
3.1.6 From Data Analytics to Decision Analytics
3.1.7 Batch vs. Online Learning
3.2 Adaptive Learning Using Exponential Smoothing
3.3 Lookup Tables with Frequentist Updating
3.4 Lookup Tables with Bayesian Updating
3.4.1 The Updating Equations for Independent Beliefs
3.4.2 Updating for Correlated Beliefs
3.4.3 Gaussian Process Regression
3.5 Computing Bias and Variance*
3.6 Lookup Tables and Aggregation*
3.6.1 Hierarchical Aggregation
3.6.2 Estimates of Different Levels of Aggregation
3.6.3 Combining Multiple Levels of Aggregation
3.7 Linear Parametric Models
3.7.1 Linear Regression Review
3.7.2 Sparse Additive Models and Lasso
3.8 Recursive Least Squares for Linear Models
3.8.1 Recursive Least Squares for Stationary Data
3.8.2 Recursive Least Squares for Nonstationary Data*
3.8.3 Recursive Estimation Using Multiple Observations*
3.9 Nonlinear Parametric Models
3.9.1 Maximum Likelihood Estimation
3.9.2 Sampled Belief Models
3.9.3 Neural Networks – Parametric*
3.9.4 Limitations of Neural Networks
3.10 Nonparametric Models*
3.10.1 K-Nearest Neighbor
3.10.2 Kernel Regression
3.10.3 Local Polynomial Regression
3.10.4 Deep Neural Networks
3.10.5 Support Vector Machines
3.10.6 Indexed Functions, Tree Structures, and Clustering
3.10.7 Comments on Nonparametric Models
3.11 Nonstationary Learning*
3.11.1 Nonstationary Learning I – Martingale Truth
3.11.2 Nonstationary Learning II – Transient Truth
3.11.3 Learning Processes
3.12 The Curse of Dimensionality
3.13 Designing Approximation Architectures in Adaptive Learning
3.14 Why Does ItWork?**
3.14.1 Derivation of the Recursive Estimation Equations
3.14.2 The Sherman-Morrison Updating Formula
3.14.3 Correlations in Hierarchical Estimation
3.14.4 Proof of Proposition 3.14.1
3.15 Bibliographic Notes
Exercises
Bibliography
4 Introduction to Stochastic Search
4.1 Illustrations of the Basic Stochastic Optimization Problem
4.2 Deterministic Methods
4.2.1 A “Stochastic” Shortest Path Problem
4.2.2 A Newsvendor Problem with Known Distribution
4.2.3 Chance-Constrained Optimization
4.2.4 Optimal Control
4.2.5 Discrete Markov Decision Processes
4.2.6 Remarks
4.3 Sampled Models
4.3.1 Formulating a Sampled Model
4.3.1.1 A Sampled Stochastic Linear Program
4.3.1.2 Sampled Chance-Constrained Models
4.3.1.3 Sampled Parametric Models
4.3.2 Convergence
4.3.3 Creating a Sampled Model
4.3.4 Decomposition Strategies*
4.4 Adaptive Learning Algorithms
4.4.1 Modeling Adaptive Learning Problems
4.4.2 Online vs. Offline Applications
4.4.2.1 Machine Learning
4.4.2.2 Optimization
4.4.3 Objective Functions for Learning
4.4.4 Designing Policies
4.5 Closing Remarks
4.6 Bibliographic Notes
Exercises
Bibliography
Part II – Stochastic Search
5 Derivative-Based Stochastic Search
5.1 Some Sample Applications
5.2 Modeling Uncertainty
5.2.1 Training Uncertainty?1,…,??
5.2.2 Model Uncertainty ?0
5.2.3 Testing Uncertainty
5.2.4 Policy Evaluation
5.2.5 Closing Notes
5.3 Stochastic Gradient Methods
5.3.1 A Stochastic Gradient Algorithm
5.3.2 Introduction to Stepsizes
5.3.3 Evaluating a Stochastic Gradient Algorithm
5.3.4 A Note on Notation
5.4 Styles of Gradients
5.4.1 Gradient Smoothing
5.4.2 Second-Order Methods
5.4.3 Finite Differences
5.4.4 SPSA
5.4.5 Constrained Problems
5.5 Parameter Optimization for Neural Networks*
5.5.1 Computing the Gradient
5.5.2 The Stochastic Gradient Algorithm
5.6 Stochastic Gradient Algorithm as a Sequential Decision Problem
5.7 Empirical Issues
5.8 Transient Problems*
5.9 Theoretical Performance*
5.10 Why Does it Work?
5.10.1 Some Probabilistic Preliminaries
5.10.2 An Older Proof*
5.10.3 A More Modern Proof**
5.11 Bibliographic Notes
Exercises
Bibliography
6 Stepsize Policies
6.1 Deterministic Stepsize Policies
6.1.1 Properties for Convergence
6.1.2 A Collection of Deterministic Policies
6.1.2.1 Constant Stepsizes
6.1.2.2 Generalized Harmonic Stepsizes
6.1.2.3 Polynomial Learning Rates
6.1.2.4 McClain's Formula
6.1.2.5 Search-then-Converge Learning Policy
6.2 Adaptive Stepsize Policies
6.2.1 The Case for Adaptive Stepsizes
6.2.2 Convergence Conditions
6.2.3 A Collection of Stochastic Policies
6.2.3.1 Kesten’s Rule
6.2.3.2 Trigg’s Formula
6.2.3.3 Stochastic Gradient Adaptive Stepsize Rule
6.2.3.4 ADAM
6.2.3.5 AdaGrad
6.2.3.6 RMSProp
6.2.4 Experimental Notes
6.3 Optimal Stepsize Policies*
6.3.1 Optimal Stepsizes for Stationary Data
6.3.2 Optimal Stepsizes for Nonstationary Data – I
6.3.3 Optimal Stepsizes for Nonstationary Data – II
6.4 Optimal Stepsizes for Approximate Value Iteration*
6.5 Convergence
6.6 Guidelines for Choosing Stepsize Policies
6.7 Why Does itWork*
6.7.1 Proof of BAKF Stepsize
6.8 Bibliographic Notes
Exercises
Bibliography
7 Derivative-Free Stochastic Search
7.1 Overview of Derivative-free Stochastic Search
7.1.1 Applications and Time Scales
7.1.2 The Communities of Derivative-free Stochastic Search
7.1.3 The Multiarmed Bandit Story
7.1.4 From Passive Learning to Active Learning to Bandit Problems
7.2 Modeling Derivative-free Stochastic Search
7.2.1 The Universal Model
7.2.2 Illustration: Optimizing a Manufacturing Process
7.2.3 Major Problem Classes
7.3 Designing Policies
7.4 Policy Function Approximations
7.5 Cost Function Approximations
7.6 VFA-based Policies
7.6.1 An Optimal Policy
7.6.2 Beta-Bernoulli Belief Model
7.6.3 Backward Approximate Dynamic Programming
7.6.4 Gittins Indices for Learning in Steady State*
7.7 Direct Lookahead Policies
7.7.1 When do we Need Lookahead Policies?
7.7.2 Single Period Lookahead Policies
7.7.3 Restricted Multiperiod Lookahead
7.7.4 Multiperiod Deterministic Lookahead
7.7.5 Multiperiod Stochastic Lookahead Policies
7.7.6 Hybrid Direct Lookahead
7.8 The Knowledge Gradient (Continued)*
7.8.1 The Belief Model
7.8.2 The Knowledge Gradient for Maximizing Final Reward
7.8.3 The Knowledge Gradient for Maximizing Cumulative Reward
7.8.4 The Knowledge Gradient for Sampled Belief Model*
7.8.5 Knowledge Gradient for Correlated Beliefs
7.9 Learning in Batches
7.10 Simulation Optimization*
7.10.1 An Indifference Zone Algorithm
7.10.2 Optimal Computing Budget Allocation
7.11 Evaluating Policies
7.11.1 Alternative Performance Metrics*
7.11.2 Perspectives of Optimality*
7.12 Designing Policies
7.12.1 Characteristics of a Policy
7.12.2 The Effect of Scaling
7.12.3 Tuning
7.13 Extensions*
7.13.1 Learning in Nonstationary Settings
7.13.2 Strategies for Designing Time-dependent Policies
7.13.3 A Transient Learning Model
7.13.4 The Knowledge Gradient for Transient Problems
7.13.5 Learning with Large or Continuous Choice Sets
7.13.6 Learning with Exogenous State Information – the Contextual Bandit Problem
7.13.7 State-dependent vs. State-independent Problems
7.14 Bibliographic Notes
Exercises
Bibliography
Part III – State-dependent Problems
8 State-dependent Problems
8.1 Graph Problems
8.1.1 A Stochastic Shortest Path Problem
8.1.2 The Nomadic Trucker
8.1.3 The Transformer Replacement Problem
8.1.4 Asset Valuation
8.2 Inventory Problems
8.2.1 A Basic Inventory Problem
8.2.2 The Inventory Problem – II
8.2.3 The Lagged Asset Acquisition Problem
8.2.4 The Batch Replenishment Problem
8.3 Complex Resource Allocation Problems
8.3.1 The Dynamic Assignment Problem
8.3.2 The Blood Management Problem
8.4 State-dependent Learning Problems
8.4.1 Medical Decision Making
8.4.2 Laboratory Experimentation
8.4.3 Bidding for Ad-clicks
8.4.4 An Information-collecting Shortest Path Problem
8.5 A Sequence of Problem Classes
8.6 Bibliographic Notes
Exercises
Bibliography
9 Modeling Sequential Decision Problems
9.1 A Simple Modeling Illustration
9.2 Notational Style
9.3 Modeling Time
9.4 The States of Our System
9.4.1 Defining the State Variable
9.4.2 The Three States of Our System
9.4.3 Initial State ?0 vs. Subsequent States ?? , ? > 0
9.4.4 Lagged State Variables*
9.4.5 The Post-decision State Variable*
9.4.6 A Shortest Path Illustration
9.4.7 Belief States*
9.4.8 Latent Variables*
9.4.9 Rolling Forecasts*
9.4.10 Flat vs. Factored State Representations*
9.4.11 A Programmer’s Perspective of State Variables
9.5 Modeling Decisions
9.5.1 Types of Decisions
9.5.2 Initial Decision x0 vs. Subsequent Decisions xt, t > 0
9.5.3 Strategic, Tactical, and Execution Decisions
9.5.4 Constraints
9.5.5 Introducing Policies
9.6 The Exogenous Information Process
9.6.1 Basic Notation for Information Processes
9.6.2 Outcomes and Scenarios
9.6.3 Lagged Information Processes*
9.6.4 Models of Information Processes*
9.6.5 Supervisory Processes*
9.7 The Transition Function
9.7.1 A General Model
9.7.2 Model-free Dynamic Programming
9.7.3 Exogenous Transitions
9.8 The Objective Function
9.8.1 The Performance Metric
9.8.2 Optimizing the Policy
9.8.3 Dependence of Optimal Policy on S0
9.8.4 State-dependent Variations
9.8.5 Uncertainty Operators
9.9 Illustration: An Energy Storage Model
9.9.1 With a Time-series Price Model
9.9.2 With Passive Learning
9.9.3 With Active Learning
9.9.4 With Rolling Forecasts
9.10 Base Models and Lookahead Models
9.11 A Classification of Problems*
9.12 Policy Evaluation*
9.13 Advanced Probabilistic Modeling Concepts**
9.13.1 A Measure-theoretic View of Information**
9.13.2 Policies and Measurability
9.14 Looking Forward
9.15 Bibliographic Notes
Exercises
Bibliography
10 Uncertainty Modeling
10.1 Sources of Uncertainty
10.1.1 Observational Errors
10.1.2 Exogenous Uncertainty
10.1.3 Prognostic Uncertainty
10.1.4 Inferential (or Diagnostic) Uncertainty
10.1.5 Experimental Variability
10.1.6 Model Uncertainty
10.1.7 Transitional Uncertainty
10.1.8 Control/implementation Uncertainty
10.1.9 Communication Errors and Biases
10.1.10 Algorithmic Instability
10.1.11 Goal Uncertainty
10.1.12 Political/regulatory Uncertainty
10.1.13 Discussion
10.2 A Modeling Case Study: The COVID Pandemic
10.3 Stochastic Modeling
10.3.1 Sampling Exogenous Information
10.3.2 Types of Distributions
10.3.3 Modeling Sample Paths
10.3.4 State-action-dependent Processes
10.3.5 Modeling Correlations
10.4 Monte Carlo Simulation
10.4.1 Generating Uniform [0, 1] Random Variables
10.4.2 Uniform and Normal Random Variable
10.4.3 Generating Random Variables from Inverse Cumulative Distributions
10.4.4 Inverse Cumulative From Quantile Distributions
10.4.5 Distributions with Uncertain Parameters
10.5 Case Study: Modeling Electricity Prices
10.5.1 Mean Reversion
10.5.2 Jump-diffusion Models
10.5.3 Quantile Distributions
10.5.4 Regime Shifting
10.5.5 Crossing Times
10.6 Sampling vs. Sampled Models
10.6.1 Iterative Sampling: A Stochastic Gradient Algorithm
10.6.2 Static Sampling: Solving a Sampled Model
10.6.3 Sampled Representation with Bayesian Updating
10.7 Closing Notes
10.8 Bibliographic Notes
Exercises
Bibliography
11 Designing Policies
11.1 From Optimization to Machine Learning to Sequential Decision Problems
11.2 The Classes of Policies
11.3 Policy Function Approximations
11.4 Cost Function Approximations
11.5 Value Function Approximations
11.6 Direct Lookahead Approximations
11.6.1 The Basic Idea
11.6.2 Modeling the Lookahead Problem
11.6.3 The Policy-Within-a-Policy
11.7 Hybrid Strategies
11.7.1 Cost Function Approximation with Policy Function Approximations
11.7.2 Lookahead Policies with Value Function Approximations
11.7.3 Lookahead Policies with Cost Function Approximations
11.7.4 Tree Search with Rollout Heuristic and a Lookup Table Policy
11.7.5 Value Function Approximation with Policy Function Approximation
11.7.6 Fitting Value Functions Using ADP and Policy Search
11.8 Randomized Policies
11.9 Illustration: An Energy Storage Model Revisited
11.9.1 Policy Function Approximation
11.9.2 Cost Function Approximation
11.9.3 Value Function Approximation
11.9.4 Deterministic Lookahead
11.9.5 Hybrid Lookahead-Cost Function Approximation
11.9.6 Experimental Testing
11.10 Choosing the Policy Class
11.10.1 The Policy Classes
11.10.2 Policy Complexity-Computational Tradeoffs
11.10.3 Screening Questions
11.11 Policy Evaluation
11.12 Parameter Tuning
11.12.1 The Soft Issues
11.12.2 Searching Across Policy Classes
11.13 Bibliographic Notes
Exercises
Bibliography
Part IV – Policy Search
12 Policy Function Approximations and Policy Search
12.1 Policy Search as a Sequential Decision Problem
12.2 Classes of Policy Function Approximations
12.2.1 Lookup Table Policies
12.2.2 Boltzmann Policies for Discrete Actions
12.2.3 Linear Decision Rules
12.2.4 Monotone Policies
12.2.5 Nonlinear Policies
12.2.6 Nonparametric/Locally Linear Policies
12.2.7 Contextual Policies
12.3 Problem Characteristics
12.4 Flavors of Policy Search
12.5 Policy Search with Numerical Derivatives
12.6 Derivative-Free Methods for Policy Search
12.6.1 Belief Models
12.6.2 Learning Through Perturbed PFAs
12.6.3 Learning CFAs
12.6.4 DLA Using the Knowledge Gradient
12.6.5 Comments
12.7 Exact Derivatives for Continuous Sequential Problems*
12.8 ExactDerivativesforDiscreteDynamicPrograms**
12.8.1 A Stochastic Policy
12.8.2 The Objective Function
12.8.3 The Policy Gradient Theorem
12.8.4 Computing the Policy Gradient
12.9 Supervised Learning
12.10 Why Does it Work?
12.10.1 Derivation of the Policy Gradient Theorem
12.11 Bibliographic Notes
Exercises
Bibliography
13 Cost Function Approximations
13.1 General Formulation for Parametric CFA
13.2 Objective-Modified CFAs
13.2.1 Linear Cost Function Correction
13.2.2 CFAs for Dynamic Assignment Problems
13.2.3 Dynamic Shortest Paths
13.2.4 Dynamic Trading Policy
13.2.5 Discussion
13.3 Constraint-Modified CFAs
13.3.1 General Formulation of Constraint-Modified CFAs
13.3.2 A Blood Management Problem
13.3.3 An Energy Storage Example with Rolling Forecasts
13.4 Bibliographic Notes
Exercises
Bibliography
Part V – Lookahead Policies
14 Exact Dynamic Programming
14.1 Discrete Dynamic Programming
14.2 The Optimality Equations
14.2.1 Bellman's Equations
14.2.2 Computing the Transition Matrix
14.2.3 Random Contributions
14.2.4 Bellman’s Equation Using Operator Notation*
14.3 Finite Horizon Problems
14.4 Continuous Problems with Exact Solutions
14.4.1 The Gambling Problem
14.4.2 The Continuous Budgeting Problem
14.5 Infinite Horizon Problems*
14.6 Value Iteration for Infinite Horizon Problems*
14.6.1 A Gauss-Seidel Variation
14.6.2 Relative Value Iteration
14.6.3 Bounds and Rates of Convergence
14.7 Policy Iteration for Infinite Horizon Problems*
14.8 Hybrid Value-Policy Iteration*
14.9 Average Reward Dynamic Programming*
14.10 The Linear Programming Method for Dynamic Programs**
14.11 Linear Quadratic Regulation
14.12 Why Does it Work?**
14.12.1 The Optimality Equations
14.12.2 Convergence of Value Iteration
14.12.3 Monotonicity of Value Iteration
14.12.4 Bounding the Error from Value Iteration
14.12.5 Randomized Policies
14.13 Bibliographic Notes
Exercises
Bibliography
15 Backward Approximate Dynamic Programming
15.1 Backward Approximate Dynamic Programming for Finite Horizon Problems
15.1.1 Some Preliminaries
15.1.2 Backward ADP Using Lookup Tables
15.1.3 Backward ADP Algorithm with Continuous Approximations
15.2 Fitted Value Iteration for Infinite Horizon Problems
15.3 Value Function Approximation Strategies
15.3.1 Linear Models
15.3.2 Monotone Functions
15.3.3 Other Approximation Models
15.4 Computational Observations
15.4.1 Experimental Benchmarking of Backward ADP
15.4.2 Computational Notes
15.5 Bibliographic Notes
Exercises
Bibliography
16 Forward ADP I: The Value of a Policy
16.1 Sampling the Value of a Policy
16.1.1 Direct Policy Evaluation for Finite Horizon Problems
16.1.2 Policy Evaluation for Infinite Horizon Problems
16.1.3 Temporal Difference Updates
16.1.4 TD(?)
16.1.5 TD(0) and Approximate Value Iteration
16.1.6 TD Learning for Infinite Horizon Problems
16.2 Stochastic Approximation Methods
16.3 Bellman's Equation Using a Linear Model*
16.3.1 A Matrix-based Derivation**
16.3.2 A Simulation-based Implementation
16.3.3 Least Squares Temporal Difference Learning (LSTD)
16.3.4 Least Squares Policy Evaluation
16.4 Analysis of TD(0), LSTD, and LSPE Using a Single State*
16.4.1 Recursive Least Squares and TD(0)
16.4.2 Least Squares Policy Evaluation
16.4.3 Least Squares Temporal Difference Learning
16.4.4 Discussion
16.5 Gradient-based Methods for Approximate Value Iteration*
16.5.1 Approximate Value Iteration with Linear Models**
16.5.2 A Geometric View of Linear Models*
16.6 Value Function Approximations Based on Bayesian Learning*
16.6.1 Minimizing Bias for Infinite Horizon Problems
16.6.2 Lookup Tables with Correlated Beliefs
16.6.3 Parametric Models
16.6.4 Creating the Prior
16.7 Learning Algorithms and Atepsizes
16.7.1 Least Squares Temporal Differences
16.7.2 Least Squares Policy Evaluation
16.7.3 Recursive Least Squares
16.7.4 Bounding 1∕? Convergence for Approximate value Iteration
16.7.5 Discussion
16.8 Bibliographic Notes
Exercises
Bibliography
17 Forward ADP II: Policy Optimization
17.1 Overview of Algorithmic Strategies
17.2 Approximate Value Iteration and Q-Learning Using Lookup Tables
17.2.1 Value Iteration Using a Pre-Decision State Variable
17.2.2 Q-Learning
17.2.3 Value Iteration Using a Post-Decision State Variable
17.2.4 Value Iteration Using a Backward Pass
17.3 Styles of Learning
17.3.1 Offline Learning
17.3.2 From Offline to Online
17.3.3 Evaluating Offline and Online Learning Policies
17.3.4 Lookahead Policies
17.4 Approximate Value Iteration Using Linear Models
17.5 On-policy vs. off-policy learning and the exploration–exploitation problem
17.5.1 Terminology
17.5.2 Learning with Lookup Tables
17.5.3 Learning with Generalized Belief Models
17.6 Applications
17.6.1 Pricing an American Option
17.6.2 Playing ``Lose Tic-Tac-Toe''
17.6.3 Approximate Dynamic Programming for Deterministic Problems
17.7 Approximate Policy Iteration
17.7.1 Finite Horizon Problems Using Lookup Tables
17.7.2 Finite Horizon Problems Using Linear Models
17.7.3 LSTD for Infinite Horizon Problems Using Linear Models
17.8 The Actor–Critic Paradigm
17.9 Statistical Bias in the Max Operator*
17.10 The Linear Programming Method Using Linear Models*
17.11 Finite Horizon Approximations for Steady-State Applications
17.12 Bibliographic Notes
Exercises
Bibliography
18 Forward ADP III: Convex Resource Allocation Problems
18.1 Resource Allocation Problems
18.1.1 The Newsvendor Problem
18.1.2 Two-Stage Resource Allocation Problems
18.1.3 A General Multiperiod Resource Allocation Model*
18.2 Values Versus Marginal Values
18.3 Piecewise Linear Approximations for Scalar Functions
18.3.1 The Leveling Algorithm
18.3.2 The CAVE Algorithm
18.4 Regression Methods
18.5 Separable Piecewise Linear Approximations
18.6 Benders Decomposition for Nonseparable Approximations**
18.6.1 Benders' Decomposition for Two-Stage Problems
18.6.2 Asymptotic Analysis of Benders with Regularization**
18.6.3 Benders with Regularization
18.7 Linear Approximations for High-Dimensional Applications
18.8 Resource Allocation with Exogenous Information State
18.9 Closing Notes
18.10 Bibliographic Notes
Exercises
Bibliography
19 Direct Lookahead Policies
19.1 Optimal Policies Using Lookahead Models
19.2 Creating an Approximate Lookahead Model
19.2.1 Modeling the Lookahead Model
19.2.2 Strategies for Approximating the Lookahead Model
19.3 Modified Objectives in Lookahead Models
19.3.1 Managing Risk
19.3.2 Utility Functions for Multiobjective Problems
19.3.3 Model Discounting
19.4 Evaluating DLA Policies
19.4.1 Evaluating Policies in a Simulator
19.4.2 Evaluating Risk-Adjusted Policies
19.4.3 Evaluating Policies in the Field
19.4.4 Tuning Direct Lookahead Policies
19.5 Why Use a DLA?
19.6 Deterministic Lookaheads
19.6.1 A Deterministic Lookahead: Shortest Path Problems
19.6.2 Parameterized Lookaheads
19.7 A Tour of Stochastic Lookahead Policies
19.7.1 Lookahead PFAs
19.7.2 Lookahead CFAs
19.7.3 Lookahead VFAs for the Lookahead Model
19.7.4 Lookahead DLAs for the Lookahead Model
19.7.5 Discussion
19.8 Monte Carlo Tree Search for Discrete Decisions
19.8.1 Basic Idea
19.8.2 The Steps of MCTS
19.8.3 Discussion
19.8.4 Optimistic Monte Carlo Tree Search
19.9 Two-Stage Stochastic Programming for Vector Decisions*
19.9.1 The Basic Two-Stage Stochastic Program
19.9.2 Two-Stage Approximation of a Sequential Problem
19.9.3 Discussion
19.10 Observations on DLA Policies
19.11 Bibliographic Notes
Exercises
Bibliography
Part VI – Multiagent Systems
20 Multiagent Modeling and Learning
20.1 Overview of Multiagent Systems
20.1.1 Dimensions of a Multiagent System
20.1.2 Communication
20.1.3 Modeling a Multiagent System
20.1.4 Controlling Architectures
20.2 A Learning Problem – Flu Mitigation
20.2.1 Model 1: A Static Model
20.2.2 Variations of Our Flu Model
20.2.3 Two-Agent Learning Models
20.2.4 Transition Functions for Two-Agent Model
20.2.5 Designing Policies for the Flu Problem
20.3 The POMDP Perspective*
20.4 The Two-Agent Newsvendor Problem
20.5 Multiple Independent Agents – An HVAC Controller Model
20.5.1 Model
20.5.2 Designing Policies
20.6 Cooperative Agents – A Spatially Distributed Blood Management Problem
20.7 Closing Notes
20.8 Why Does itWork?
20.8.1 Derivation of the POMDP Belief Transition Function
20.9 Bibliographic Notes
Exercises
Bibliography
Index
EULA