Reinforcement learning (RL) will deliver one of the biggest breakthroughs in AI over the next decade, enabling algorithms to learn from their environment to achieve arbitrary goals. This exciting development avoids constraints found in traditional machine learning (ML) algorithms. This practical book shows data science and AI professionals how to learn by reinforcementand enable a machine to learn by itself.
Author Phil Winder of Winder Research covers everything from basic building blocks to state-of-the-art practices. You'll explore the current state of RL, focus on industrial applications, learn numerous algorithms, and benefit from dedicated chapters on deploying RL solutions to production. This is no cookbook; doesn't shy away from math and expects familiarity with ML.
• Learn what RL is and how the algorithms help solve problems
• Become grounded in RL fundamentals including Markov decision processes, dynamic programming, and temporal difference learning
• Dive deep into a range of value and policy gradient methods
• Apply advanced RL solutions such as meta learning, hierarchical learning, multi-agent, and imitation learning
• Understand cutting-edge deep RL algorithms including Rainbow, PPO, TD3, SAC, and more
• Get practical examples through the accompanying website
Author(s): Phil Winder
Edition: 1
Publisher: O'Reilly Media
Year: 2020
Language: English
Commentary: Vector PDF
Pages: 408
City: Sebastopol, CA
Tags: Deep Learning; Reinforcement Learning; Dynamic Programming; Temporal Difference Learning; Entropy; Q-Learning; Markov Decision Process; Monte Carlo Simulations; n-Step Algorithms; Deep Q-Networks; Policy Gradient Methods; Hierarchical Reinforcement Learning; Multi-Agent Reinforcement Learning
Cover
Copyright
Table of Contents
Preface
Objective
Who Should Read This Book?
Guiding Principles and Style
Prerequisites
Scope and Outline
Supplementary Materials
Conventions Used in This Book
Acronyms
Mathematical Notation
Fair Use Policy
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Why Reinforcement Learning?
Why Now?
Machine Learning
Reinforcement Learning
When Should You Use RL?
RL Applications
Taxonomy of RL Approaches
Model-Free or Model-Based
How Agents Use and Update Their Strategy
Discrete or Continuous Actions
Optimization Methods
Policy Evaluation and Improvement
Fundamental Concepts in Reinforcement Learning
The First RL Algorithm
Is RL the Same as ML?
Reward and Feedback
Reinforcement Learning as a Discipline
Summary
Further Reading
Chapter 2. Markov Decision Processes, Dynamic Programming, and Monte Carlo Methods
Multi-Arm Bandit Testing
Reward Engineering
Policy Evaluation: The Value Function
Policy Improvement: Choosing the Best Action
Simulating the Environment
Running the Experiment
Improving the ϵ-greedy Algorithm
Markov Decision Processes
Inventory Control
Inventory Control Simulation
Policies and Value Functions
Discounted Rewards
Predicting Rewards with the State-Value Function
Predicting Rewards with the Action-Value Function
Optimal Policies
Monte Carlo Policy Generation
Value Iteration with Dynamic Programming
Implementing Value Iteration
Results of Value Iteration
Summary
Further Reading
Chapter 3. Temporal-Difference Learning, Q-Learning, and n-Step Algorithms
Formulation of Temporal-Difference Learning
Q-Learning
SARSA
Q-Learning Versus SARSA
Case Study: Automatically Scaling Application Containers to Reduce Cost
Industrial Example: Real-Time Bidding in Advertising
Defining the MDP
Results of the Real-Time Bidding Environments
Further Improvements
Extensions to Q-Learning
Double Q-Learning
Delayed Q-Learning
Comparing Standard, Double, and Delayed Q-learning
Opposition Learning
n-Step Algorithms
n-Step Algorithms on Grid Environments
Eligibility Traces
Extensions to Eligibility Traces
Watkins’s Q( λ)
Fuzzy Wipes in Watkins’s Q( λ)
Speedy Q-Learning
Accumulating Versus Replacing Eligibility Traces
Summary
Further Reading
Chapter 4. Deep Q-Networks
Deep Learning Architectures
Fundamentals
Common Neural Network Architectures
Deep Learning Frameworks
Deep Reinforcement Learning
Deep Q-Learning
Experience Replay
Q-Network Clones
Neural Network Architecture
Implementing DQN
Example: DQN on the CartPole Environment
Case Study: Reducing Energy Usage in Buildings
Rainbow DQN
Distributional RL
Prioritized Experience Replay
Noisy Nets
Dueling Networks
Example: Rainbow DQN on Atari Games
Results
Discussion
Other DQN Improvements
Improving Exploration
Improving Rewards
Learning from Offline Data
Summary
Further Reading
Chapter 5. Policy Gradient Methods
Benefits of Learning a Policy Directly
How to Calculate the Gradient of a Policy
Policy Gradient Theorem
Policy Functions
Linear Policies
Arbitrary Policies
Basic Implementations
Monte Carlo (REINFORCE)
REINFORCE with Baseline
Gradient Variance Reduction
n-Step Actor-Critic and Advantage Actor-Critic (A2C)
Eligibility Traces Actor-Critic
A Comparison of Basic Policy Gradient Algorithms
Industrial Example: Automatically Purchasing Products for Customers
The Environment: Gym-Shopping-Cart
Expectations
Results from the Shopping Cart Environment
Summary
Further Reading
Chapter 6. Beyond Policy Gradients
Off-Policy Algorithms
Importance Sampling
Behavior and Target Policies
Off-Policy Q-Learning
Gradient Temporal-Difference Learning
Greedy-GQ
Off-Policy Actor-Critics
Deterministic Policy Gradients
Deterministic Policy Gradients
Deep Deterministic Policy Gradients
Twin Delayed DDPG
Case Study: Recommendations Using Reviews
Improvements to DPG
Trust Region Methods
Kullback–Leibler Divergence
Natural Policy Gradients and Trust Region Policy Optimization
Proximal Policy Optimization
Example: Using Servos for a Real-Life Reacher
Experiment Setup
RL Algorithm Implementation
Increasing the Complexity of the Algorithm
Hyperparameter Tuning in a Simulation
Resulting Policies
Other Policy Gradient Algorithms
Retrace( λ)
Actor-Critic with Experience Replay (ACER)
Actor-Critic Using Kronecker-Factored Trust Regions (ACKTR)
Emphatic Methods
Extensions to Policy Gradient Algorithms
Quantile Regression in Policy Gradient Algorithms
Summary
Which Algorithm Should I Use?
A Note on Asynchronous Methods
Further Reading
Chapter 7. Learning All Possible Policies with Entropy Methods
What Is Entropy?
Maximum Entropy Reinforcement Learning
Soft Actor-Critic
SAC Implementation Details and Discrete Action Spaces
Automatically Adjusting Temperature
Case Study: Automated Traffic Management to Reduce Queuing
Extensions to Maximum Entropy Methods
Other Measures of Entropy (and Ensembles)
Optimistic Exploration Using the Upper Bound of Double Q-Learning
Tinkering with Experience Replay
Soft Policy Gradient
Soft Q-Learning (and Derivatives)
Path Consistency Learning
Performance Comparison: SAC Versus PPO
How Does Entropy Encourage Exploration?
How Does the Temperature Parameter Alter Exploration?
Industrial Example: Learning to Drive with a Remote Control Car
Description of the Problem
Minimizing Training Time
Dramatic Actions
Hyperparameter Search
Final Policy
Further Improvements
Summary
Equivalence Between Policy Gradients and Soft Q-Learning
What Does This Mean For the Future?
What Does This Mean Now?
Chapter 8. Improving How an Agent Learns
Rethinking the MDP
Partially Observable Markov Decision Process
Case Study: Using POMDPs in Autonomous Vehicles
Contextual Markov Decision Processes
MDPs with Changing Actions
Regularized MDPs
Hierarchical Reinforcement Learning
Naive HRL
High-Low Hierarchies with Intrinsic Rewards (HIRO)
Learning Skills and Unsupervised RL
Using Skills in HRL
HRL Conclusions
Multi-Agent Reinforcement Learning
MARL Frameworks
Centralized or Decentralized
Single-Agent Algorithms
Case Study: Using Single-Agent Decentralized Learning in UAVs
Centralized Learning, Decentralized Execution
Decentralized Learning
Other Combinations
Challenges of MARL
MARL Conclusions
Expert Guidance
Behavior Cloning
Imitation RL
Inverse RL
Curriculum Learning
Other Paradigms
Meta-Learning
Transfer Learning
Summary
Further Reading
Chapter 9. Practical Reinforcement Learning
The RL Project Life Cycle
Life Cycle Definition
Problem Definition: What Is an RL Project?
RL Problems Are Sequential
RL Problems Are Strategic
Low-Level RL Indicators
Types of Learning
RL Engineering and Refinement
Process
Environment Engineering
State Engineering or State Representation Learning
Policy Engineering
Mapping Policies to Action Spaces
Exploration
Reward Engineering
Summary
Further Reading
Chapter 10. Operational Reinforcement Learning
Implementation
Frameworks
Scaling RL
Evaluation
Deployment
Goals
Architecture
Ancillary Tooling
Safety, Security, and Ethics
Summary
Further Reading
Chapter 11. Conclusions and the Future
Tips and Tricks
Framing the Problem
Your Data
Training
Evaluation
Deployment
Debugging
${ALGORITHM_NAME} Can’t Solve ${ENVIRONMENT}!
Monitoring for Debugging
The Future of Reinforcement Learning
RL Market Opportunities
Future RL and Research Directions
Concluding Remarks
Next Steps
Now It’s Your Turn
Further Reading
Appendix A. The Gradient of a Logistic Policy for Two Actions
Appendix B. The Gradient of a Softmax Policy
Glossary
Acronyms and Common Terms
Symbols and Notation
Index
About the Author
Contact Details
Colophon