Get hands-on experience in creating state-of-the-art reinforcement learning agents using TensorFlow and RLlib to solve complex real-world business and industry problems with the help of expert tips and best practices
Key Features
• Understand how large-scale state-of-the-art RL algorithms and approaches work
• Apply RL to solve complex problems in marketing, robotics, supply chain, finance, cybersecurity, and more
• Explore tips and best practices from experts that will enable you to overcome real-world RL challenges
Book Description
Reinforcement learning (RL) is a field of artificial intelligence (AI) used for creating self-learning autonomous agents. Building on a strong theoretical foundation, this book takes a practical approach and uses examples inspired by real-world industry problems to teach you about state-of-the-art RL.
Starting with bandit problems, Markov decision processes, and dynamic programming, the book provides an in-depth review of the classical RL techniques, such as Monte Carlo methods and temporal-difference learning. After that, you will learn about deep Q-learning, policy gradient algorithms, actor-critic methods, model-based methods, and multi-agent reinforcement learning. Then, you'll be introduced to some of the key approaches behind the most successful RL implementations, such as domain randomization and curiosity-driven learning.
As you advance, you'll explore many novel algorithms with advanced implementations using modern Python libraries such as TensorFlow and Ray's RLlib package. You'll also find out how to implement RL in areas such as robotics, supply chain management, marketing, finance, smart cities, and cybersecurity while assessing the trade-offs between different approaches and avoiding common pitfalls.
By the end of this book, you'll have mastered how to train and deploy your own RL agents for solving RL problems.
What you will learn
• Model and solve complex sequential decision-making problems using RL
• Develop a solid understanding of how state-of-the-art RL methods work
• Use Python and TensorFlow to code RL algorithms from scratch
• Parallelize and scale up your RL implementations using Ray's RLlib package
• Get in-depth knowledge of a wide variety of RL topics
• Understand the trade-offs between different RL approaches
• Discover and address the challenges of implementing RL in the real world
Who this book is for
This book is for expert machine learning practitioners and researchers looking to focus on hands-on reinforcement learning with Python by implementing advanced deep reinforcement learning concepts in real-world projects. Reinforcement learning experts who want to advance their knowledge to tackle large-scale and complex sequential decision-making problems will also find this book useful. Working knowledge of Python programming and deep learning along with prior experience in reinforcement learning is required.
About the Author
Enes Bilgin works as a senior AI engineer and a tech lead in Microsoft's Autonomous Systems division. He is a machine learning and operations research practitioner and researcher with experience in building production systems and models for top tech companies using Python, TensorFlow, and Ray/RLlib. He holds an M.S. and a Ph.D. in systems engineering from Boston University and a B.S. in industrial engineering from Bilkent University. In the past, he has worked as a research scientist at Amazon and as an operations research scientist at AMD. He also held adjunct faculty positions at the McCombs School of Business at the University of Texas at Austin and at the Ingram School of Engineering at Texas State University.
Author(s): Enes Bilgin
Edition: 1
Publisher: Packt Publishing
Year: 2020
Language: English
Commentary: Vector PDF
Pages: 532
City: Birmingham, UK
Tags: Machine Learning; Deep Learning; Reinforcement Learning; Cybersecurity; Marketing; Finance; Dynamic Programming; Distributed Processing; Temporal Difference Learning; Inventory Management; Smart Cities; Approximation Algorithms; Q-Learning; Markov Decision Process; Model-Based Reinforcement Learning; Actor-Critic Method; Monte Carlo Simulations; Deep Q-Networks; Policy Gradient Methods; Multi-Agent Reinforcement Learning; Machine Teaching; Domain Randomization; Meta-Reinforcement Learning; Auton
Cover
Title Page
Copyright and Credits
About Packt
Contributors
Table of Contents
Preface
Section 1: Reinforcement Learning Foundations
Chapter 1: Introduction to Reinforcement Learning
Why RL?
The three paradigms of machine learning
Supervised learning
Unsupervised learning
Reinforcement learning
RL application areas and success stories
Games
Robotics and autonomous systems
Supply chain
Manufacturing
Personalization and recommender systems
Smart cities
Elements of an RL problem
RL concepts
Casting tic-tac-toe as an RL problem
Setting up your RL environment
Hardware requirements
Operating system
Software toolbox
Summary
References
Chapter 2: Multi-Armed Bandits
Exploration-exploitation trade-off
What is a MAB?
Problem definition
Experimenting with a simple MAB problem
Case study – online advertising
A/B/n testing
Notation
Application to the online advertising scenario
Advantages and disadvantages of A/B/n testing
ε-greedy actions
Application to the online advertising scenario
Advantages and disadvantages of ε-greedy actions
Action selection using upper confidence bounds
Application to the online advertising scenario
Advantages and disadvantages of using UCBs
Thompson (posterior) sampling
Application to the online advertising scenario
Advantages and disadvantages of Thompson sampling
Summary
References
Chapter 3: Contextual Bandits
Why we need function approximations
Using function approximations for context
Case study – contextual online advertising with synthetic user data
Function approximation with regularized logistic regression
Objective – regret minimization
Solving the online advertising problem
Using function approximations for actions
Case study – contextual online advertising with user data from the US Census
Function approximation using a neural network
Calculating the regret
Solving the online advertising problem
Other applications of multi-armed bandits and CBs
Recommender systems
Web page/app feature design
Healthcare
Dynamic pricing
Finance
Control systems tuning
Summary
References
Chapter 4: Makings of the Markov Decision Process
Starting with Markov chains
Stochastic processes with the Markov property
Classification of states in a Markov chain
Transitionary and steady state behavior
Example – n-step behavior in the grid world
Example – a sample path in an ergodic Markov chain
Semi-Markov processes and continuous-time Markov chains
Introducing the reward – Markov reward process
Attaching rewards to the grid world example
Relationships between average rewards with different initializations
Return, discount, and state values
Analytically calculating the state values
Estimating the state values iteratively
Bringing the action in – MDP
Definition
Grid world as an MDP
State-value function
Action-value function
Optimal state-value and action-value functions
Bellman optimality
Partially observable MDPs
Summary
Exercises
Further reading
Chapter 5: Solving the Reinforcement Learning Problem
Exploring dynamic programming
Example use case – Inventory replenishment of a food truck
Policy evaluation
Policy iteration
Value iteration
Drawbacks of dynamic programming
Training your agent with Monte Carlo methods
Monte Carlo prediction
Monte Carlo control
Temporal-difference learning
One-step TD learning – TD(0)
n-step TD learning
Understanding the importance of simulation in reinforcement learning
Summary
Exercises
References
Section 2: Deep Reinforcement Learning
Chapter 6: Deep Q-Learning at Scale
From tabular Q-learning to deep Q-learning
Neural-fitted Q-iteration
Online Q-learning
Deep Q-networks
Key concepts in DQNs
The DQN algorithm
Extensions to the DQN – Rainbow
The extensions
The performance of the integrated agent
How to choose which extensions to use – ablations to Rainbow
What happened to the deadly triad?
Distributed deep Q-learning
Components of a distributed deep Q-learning architecture
Gorila – general RL architecture
Ape-X – distributed prioritized experience replay
Implementing scalable deep Q-learning algorithms using Ray
A primer on Ray
Ray implementation of a DQN variant
Using RLlib for production-grade deep RL
Summary
References
Chapter 7: Policy-Based Methods
Why should we use policy-based methods?
A more principled approach
The ability to use continuous action spaces
The ability to learn truly random stochastic policies
The vanilla policy gradient
The objective in policy gradient methods
Figuring out the gradient
REINFORCE
The problem with REINFORCE and all policy gradient methods
Vanilla policy gradient using RLlib
Actor-critic methods
Further reducing the variance in policy-based methods
Advantage Actor-Critic – A2C
Asynchronous Advantage Actor-Critic: A3C
Generalized advantage estimators
Trust-region methods
Policy gradient as policy iteration
TRPO – Trust Region Policy Optimization
PPO – Proximal Policy Optimization
Off-policy methods
DDPG – Deep Deterministic Policy Gradient
TD3 – Twin Delayed Deep Deterministic Policy Gradient
SAC – Soft actor-critic
IMPALA – Importance Weighted Actor-Learner Architecture
A comparison of the policy-based methods in Lunar Lander
How to pick the right algorithm
Open source implementations of policy-gradient methods
Summary
References
Chapter 8: Model-Based Methods
Technical requirements
Introducing model-based methods
Planning through a model
Defining the optimal control problem
Random shooting
Cross-entropy method
Covariance matrix adaptation evolution strategy
Monte Carlo tree search
Learning a world model
Understanding what model means
Identifying when to learn a model
Introducing a general procedure to learn a model
Understanding and mitigating the impact of model uncertainty
Learning a model from complex observations
Unifying model-based and model-free approaches
Refresher on Q-learning
Dyna-style acceleration of model-free methods using world models
Summary
References
Chapter 9: Multi-Agent Reinforcement Learning
Introducing multi-agent reinforcement learning
Collaboration and competition between MARL agents
Exploring the challenges in multi-agent reinforcement learning
Non-stationarity
Scalability
Unclear reinforcement learning objective
Information sharing
Training policies in multi-agent settings
RLlib multi-agent environment
Competitive self-play
Training tic-tac-toe agents through self-play
Designing the multi-agent tic-tac-toe environment
Configuring the trainer
Observing the results
Summary
References
Section 3: Advanced Topics in RL
Chapter 10: Machine Teaching
Technical requirements
Introduction to MT
Understanding the need for MT
Exploring the elements of MT
Engineering the reward function
When to engineer the reward function
Reward shaping
Example – reward shaping for a mountain car
Challenges with engineering the reward function
Curriculum learning
Warm starts and demonstration learning
Action masking
Concept networks
Downsides and the promises of MT
Summary
References
Chapter 11: Generalization and Domain Randomization
An overview of generalization and partial observability
Generalization and overfitting in supervised learning
Generalization and overfitting in RL
The connection between generalization and partial observability
Overcoming partial observability with memory
Overcoming overfitting with randomization
Recipe for generalization
Domain randomization for generalization
Dimensions of randomization
Quantifying generalization
The effect of regularization and network architecture on the generalization of RL policies
Network randomization and feature matching
Curriculum learning for generalization
Sunblaze environment
Using memory to overcome partial observability
Stacking observations
Using RNNs
Transformer architecture
Summary
References
Chapter 12: Meta-Reinforcement Learning
Introduction to meta-RL
Learning to learn
Defining meta-RL
Relation to animal learning and the Harlow experiment
Relation to partial observability and domain randomization
Meta-RL with recurrent policies
Grid world example
RLlib implementation
Gradient-based meta-RL
RLlib implementation
Meta-RL as partially observed RL
Challenges in meta-RL
Summary
References
Chapter 13: Other Advanced Topics
Distributed reinforcement learning
Scalable, efficient deep reinforcement learning – SEED RL
Recurrent experience replay in distributed reinforcement learning
Experimenting with SEED RL and R2D2
Curiosity-driven reinforcement learning
Curiosity-driven learning for hard-exploration problems
Challenges in curiosity-driven reinforcement learning
Never Give Up
Agent57 improvements
Offline reinforcement learning
An overview of how offline reinforcement learning works
Why we need special algorithms for offline learning
Why offline reinforcement learning is crucial
Advantage weighted actor-critic
Offline reinforcement learning benchmarks
Summary
References
Section 4: Applications of RL
Chapter 14: Autonomous Systems
Introducing PyBullet
Setting up PyBullet
Getting familiar with the KUKA environment
Grasping a rectangular block using a KUKA robot
The KUKA Gym environment
Developing strategies to solve the KUKA environment
Parametrizing the difficulty of the problem
Using curriculum learning to train the KUKA robot
Customizing the environment for curriculum learning
Designing the lessons in the curriculum
Training the agent using a manually designed curriculum
Curriculum learning using absolute learning progress
Comparing the experiment results
Going beyond PyBullet into autonomous driving
Summary
References
Chapter 15: Supply Chain Management
Optimizing inventory procurement decisions
The need for inventory and the trade-offs in its management
Components of an inventory optimization problem
Single-step inventory optimization – the newsvendor problem
Simulating multi-step inventory dynamics
Developing a near-optimal benchmark policy
A reinforcement learning solution for inventory management
Modeling routing problems
Pick-up and delivery of online meal orders
Pointer networks for dynamic combinatorial optimization
Summary
References
Chapter 16: Marketing, Personalization and Finance
Going beyond bandits for personalization
Shortcomings of bandit models
Deep RL for news recommendation
Developing effective marketing strategies using RL
Personalized marketing content
Marketing resource allocation for customer acquisition
Reducing the customer churn rate
Winning back lost customers
Applying RL in finance
Challenges with using RL in finance
Introducing TensorTrade
Developing equity trading strategies
Summary
References
Chapter 17: Smart City and Cybersecurity
Traffic light control to optimize vehicle flow
Introducing Flow
Creating an experiment in Flow
Modeling the traffic light control problem
Solving the traffic control problem using RLlib
Further reading
Providing an ancillary service to a power grid
Power grid operations and ancillary services
Describing the environment and the decision-making problem
RL model
Detecting cyberattacks in a smart grid
The problem of early detection of cyberattacks in a power grid
Partial observability of the grid state
Summary
References
Chapter 18: Challenges and Future Directions in Reinforcement Learning
What you have achieved with this book
Challenges and future directions
Sample efficiency
Need for high-fidelity and fast simulation models
High-dimensional action spaces
Reward function fidelity
Safety, behavior guarantees, and explainability
Reproducibility and sensitivity to hyperparameter choices
Robustness and adversarial agents
Suggestions for aspiring RL experts
Go deeper into the theory
Follow good practitioners and research labs
Learn from papers and their good explanations
Stay up to date with trends in other fields of deep learning
Read open source repos
Practice!
Final words
References
Other Books You May Enjoy
Index