Foundations of Reinforcement Learning with Applications in Finance

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Foundations of Reinforcement Learning with Applications in Finance aims to demystify Reinforcement Learning, and to make it a practically useful tool for those studying and working in applied areas ― especially finance.

Reinforcement Learning is emerging as a powerful technique for solving a variety of complex problems across industries that involve Sequential Optimal Decisioning under Uncertainty. Its penetration in high-profile problems like self-driving cars, robotics, and strategy games points to a future where Reinforcement Learning algorithms will have decisioning abilities far superior to humans. But when it comes getting educated in this area, there seems to be a reluctance to jump right in, because Reinforcement Learning appears to have acquired a reputation for being mysterious and technically challenging.

This book strives to impart a lucid and insightful understanding of the topic by emphasizing the foundational mathematics and implementing models and algorithms in well-designed Python code, along with robust coverage of several financial trading problems that can be solved with Reinforcement Learning. This book has been created after years of iterative experimentation on the pedagogy of these topics while being taught to university students as well as industry practitioners.

Features

    • Focus on the foundational theory underpinning Reinforcement Learning and software design of the corresponding models and algorithms
    • Suitable as a primary text for courses in Reinforcement Learning, but also as supplementary reading for applied/financial mathematics, programming, and other related courses
    • Suitable for a professional audience of quantitative analysts or data scientists
    • Blends theory/mathematics, programming/algorithms and real-world financial nuances while always striving to maintain simplicity and to build intuitive understanding.

    Author(s): Ashwin Rao, Tikhon Jelvis
    Publisher: CRC Press/Chapman & Hall
    Year: 2022

    Language: English
    Pages: 520
    City: Boca Raton

    Cover
    Half Title
    Title Page
    Copyright Page
    Contents
    Preface
    Author Biographies
    Summary of Notation
    CHAPTER 1: Overview
    1.1. LEARNING REINFORCEMENT LEARNING
    1.2. WHAT YOU WILL LEARN FROM THIS BOOK
    1.3. EXPECTED BACKGROUND TO READ THIS BOOK
    1.4. DECLUTTERING THE JARGON LINKED TO REINFORCEMENT LEARNING
    1.5. INTRODUCTION TO THE MARKOV DECISION PROCESS (MDP) FRAMEWORK
    1.6. REAL-WORLD PROBLEMS THAT FIT THE MDP FRAMEWORK
    1.7. THE INHERENT DIFFICULTY IN SOLVING MDPS
    1.8. VALUE FUNCTION, BELLMAN EQUATIONS, DYNAMIC PROGRAMMING AND RL
    1.9. OUTLINE OF CHAPTERS
    1.9.1. Module I: Processes and Planning Algorithms
    1.9.2. Module II: Modeling Financial Applications
    1.9.3. Module III: Reinforcement Learning Algorithms
    1.9.4. Module IV: Finishing Touches
    1.9.5. Short Appendix Chapters
    CHAPTER 2: Programming and Design
    2.1. CODE DESIGN
    2.2. ENVIRONMENT SETUP
    2.3. CLASSES AND INTERFACES
    2.3.1. A Distribution Interface
    2.3.2. A Concrete Distribution
    2.3.2.1. Dataclasses
    2.3.2.2. Immutability
    2.3.3. Checking Types
    2.3.3.1. Static Typing
    2.3.4. Type Variables
    2.3.5. Functionality
    2.4. ABSTRACTING OVER COMPUTATION
    2.4.1. First-Class Functions
    2.4.1.1. Lambdas
    2.4.2. Iterative Algorithms
    2.4.2.1. Iterators and Generators
    2.5. KEY TAKEAWAYS FROM THIS CHAPTER
    MODULE I: Processes and Planning Algorithms
    Chapter 3: Markov Processes
    3.1. THE CONCEPT OF STATE IN A PROCESS
    3.2. UNDERSTANDING MARKOV PROPERTY FROM STOCK PRICE EXAMPLES
    3.3. FORMAL DEFINITIONS FOR MARKOV PROCESSES
    3.3.1. Starting States
    3.3.2. Terminal States
    3.3.3. Markov Process Implementation
    3.4. STOCK PRICE EXAMPLES MODELED AS MARKOV PROCESSES
    3.5. FINITE MARKOV PROCESSES
    3.6. SIMPLE INVENTORY EXAMPLE
    3.7. STATIONARY DISTRIBUTION OF A MARKOV PROCESS
    3.8. FORMALISM OF MARKOV REWARD PROCESSES
    3.9. SIMPLE INVENTORY EXAMPLE AS A MARKOV REWARD PROCESS
    3.10. FINITE MARKOV REWARD PROCESSES
    3.11. SIMPLE INVENTORY EXAMPLE AS A FINITE MARKOV REWARD PROCESS
    3.12. VALUE FUNCTION OF A MARKOV REWARD PROCESS
    3.13. SUMMARY OF KEY LEARNINGS FROM THIS CHAPTER
    Chapter 4: Markov Decision Processes
    4.1. SIMPLE INVENTORY EXAMPLE: HOW MUCH TO ORDER?
    4.2. THE DIFFICULTY OF SEQUENTIAL DECISIONING UNDER UNCERTAINTY
    4.3. FORMAL DEFINITION OF A MARKOV DECISION PROCESS
    4.4. POLICY
    4.5. [MARKOV DECISION PROCESS, POLICY] := MARKOV REWARD PROCESS
    4.6. SIMPLE INVENTORY EXAMPLE WITH UNLIMITED CAPACITY (INFINITE STATE/ACTION SPACE)
    4.7. FINITE MARKOV DECISION PROCESSES
    4.8. SIMPLE INVENTORY EXAMPLE AS A FINITE MARKOV DECISION PROCESS
    4.9. MDP VALUE FUNCTION FOR A FIXED POLICY
    4.10. OPTIMAL VALUE FUNCTION AND OPTIMAL POLICIES
    4.11. VARIANTS AND EXTENSIONS OF MDPS
    4.11.1. Size of Spaces and Discrete versus Continuous
    4.11.1.1. State Space
    4.11.1.2. Action Space
    4.11.1.3. Time Steps
    4.11.2. Partially-Observable Markov Decision Processes (POMDPs)
    4.12. SUMMARY OF KEY LEARNINGS FROM THIS CHAPTER
    Chapter 5: Dynamic Programming Algorithms
    5.1. PLANNING VERSUS LEARNING
    5.2. USAGE OF THE TERM DYNAMIC PROGRAMMING
    5.3. FIXED-POINT THEORY
    5.4. BELLMAN POLICY OPERATOR AND POLICY EVALUATION ALGORITHM
    5.5. GREEDY POLICY
    5.6. POLICY IMPROVEMENT
    5.7. POLICY ITERATION ALGORITHM
    5.8. BELLMAN OPTIMALITY OPERATOR AND VALUE ITERATION ALGORITHM
    5.9. OPTIMAL POLICY FROM OPTIMAL VALUE FUNCTION
    5.10. REVISITING THE SIMPLE INVENTORY EXAMPLE
    5.11. GENERALIZED POLICY ITERATION
    5.12. ASYNCHRONOUS DYNAMIC PROGRAMMING
    5.13. FINITE-HORIZON DYNAMIC PROGRAMMING: BACKWARD INDUCTION
    5.14. DYNAMIC PRICING FOR END-OF-LIFE/END-OF-SEASON OF A PRODUCT
    5.15. GENERALIZATION TO NON-TABULAR ALGORITHMS
    5.16. SUMMARY OF KEY LEARNINGS FROM THIS CHAPTER
    Chapter 6: Function Approximation and Approximate Dynamic Programming
    6.1. FUNCTION APPROXIMATION
    6.2. LINEAR FUNCTION APPROXIMATION
    6.3. NEURAL NETWORK FUNCTION APPROXIMATION
    6.4. TABULAR AS A FORM OF FUNCTIONAPPROX
    6.5. APPROXIMATE POLICY EVALUATION
    6.6. APPROXIMATE VALUE ITERATION
    6.7. FINITE-HORIZON APPROXIMATE POLICY EVALUATION
    6.8. FINITE-HORIZON APPROXIMATE VALUE ITERATION
    6.9. FINITE-HORIZON APPROXIMATE Q-VALUE ITERATION
    6.10. HOW TO CONSTRUCT THE NON-TERMINAL STATES DISTRIBUTION
    6.11. KEY TAKEAWAYS FROM THIS CHAPTER
    MODULE II: Modeling Financial Applications
    Chapter 7: Utility Theory
    7.1. INTRODUCTION TO THE CONCEPT OF UTILITY
    7.2. A SIMPLE FINANCIAL EXAMPLE
    7.3. THE SHAPE OF THE UTILITY FUNCTION
    7.4. CALCULATING THE RISK-PREMIUM
    7.5. CONSTANT ABSOLUTE RISK-AVERSION (CARA)
    7.6. A PORTFOLIO APPLICATION OF CARA
    7.7. CONSTANT RELATIVE RISK-AVERSION (CRRA)
    7.8. A PORTFOLIO APPLICATION OF CRRA
    7.9. KEY TAKEAWAYS FROM THIS CHAPTER
    Chapter 8: Dynamic Asset-Allocation and Consumption
    8.1. OPTIMIZATION OF PERSONAL FINANCE
    8.2. MERTON’S PORTFOLIO PROBLEM AND SOLUTION
    8.3. DEVELOPING INTUITION FOR THE SOLUTION TO MERTON’S PORTFOLIO PROBLEM
    8.4. A DISCRETE-TIME ASSET-ALLOCATION EXAMPLE
    8.5. PORTING TO REAL-WORLD
    8.6. KEY TAKEAWAYS FROM THIS CHAPTER
    Chapter 9: Derivatives Pricing and Hedging
    9.1. A BRIEF INTRODUCTION TO DERIVATIVES
    9.1.1. Forwards
    9.1.2. European Options
    9.1.3. American Options
    9.2. NOTATION FOR THE SINGLE-PERIOD SIMPLE SETTING
    9.3. PORTFOLIOS, ARBITRAGE AND RISK-NEUTRAL PROBABILITY MEASURE
    9.4. FIRST FUNDAMENTAL THEOREM OF ASSET PRICING (1ST FTAP)
    9.5. SECOND FUNDAMENTAL THEOREM OF ASSET PRICING (2ND FTAP)
    9.6. DERIVATIVES PRICING IN SINGLE-PERIOD SETTING
    9.6.1. Derivatives Pricing When Market Is Complete
    9.6.2. Derivatives Pricing When Market Is Incomplete
    9.6.3. Derivatives Pricing When Market Has Arbitrage
    9.7. DERIVATIVES PRICING IN MULTI-PERIOD/CONTINUOUS-TIME
    9.7.1. Multi-Period Complete-Market Setting
    9.7.2. Continuous-Time Complete-Market Setting
    9.8. OPTIMAL EXERCISE OF AMERICAN OPTIONS CAST AS A FINITE MDP
    9.9. GENERALIZING TO OPTIMAL-STOPPING PROBLEMS
    9.10. PRICING/HEDGING IN AN INCOMPLETE MARKET CAST AS AN MDP
    9.11. KEY TAKEAWAYS FROM THIS CHAPTER
    Chapter 10: Order-Book Trading Algorithms
    10.1. BASICS OF ORDER BOOK AND PRICE IMPACT
    10.2. OPTIMAL EXECUTION OF A MARKET ORDER
    10.2.1. Simple Linear Price Impact Model with No Risk-Aversion
    10.2.2. Paper by Bertsimas and Lo on Optimal Order Execution
    10.2.3. Incorporating Risk-Aversion and Real-World Considerations
    10.3. OPTIMAL MARKET-MAKING
    10.3.1. Avellaneda-Stoikov Continuous-Time Formulation
    10.3.2. Solving the Avellaneda-Stoikov Formulation
    10.3.3. Analytical Approximation to the Solution to Avellaneda-Stoikov Formulation
    10.3.4. Real-World Market-Making
    10.4. KEY TAKEAWAYS FROM THIS CHAPTER
    MODULE III: Reinforcement Learning Algorithms
    Chapter 11: Monte-Carlo and Temporal-Difference for Prediction
    11.1. OVERVIEW OF THE REINFORCEMENT LEARNING APPROACH
    11.2. RL FOR PREDICTION
    11.3. MONTE-CARLO (MC) PREDICTION
    11.4. TEMPORAL-DIFFERENCE (TD) PREDICTION
    11.5. TD VERSUS MC
    11.5.1. TD Learning Akin to Human Learning
    11.5.2. Bias, Variance and Convergence
    11.5.3. Fixed-Data Experience Replay on TD versus MC
    11.5.4. Bootstrapping and Experiencing
    11.6. TD(λ) PREDICTION
    11.6.1. n-Step Bootstrapping Prediction Algorithm
    11.6.2. λ-Return Prediction Algorithm
    11.6.3. Eligibility Traces
    11.6.4. Implementation of the TD(λ) Prediction Algorithm
    11.7. KEY TAKEAWAYS FROM THIS CHAPTER
    Chapter 12: Monte-Carlo and Temporal-Difference for Control
    12.1. REFRESHER ON GENERALIZED POLICY ITERATION (GPI)
    12.2. GPI WITH EVALUATION AS MONTE-CARLO
    12.3. GLIE MONTE-CONTROL CONTROL
    12.4. SARSA
    12.5. SARSA(λ)
    12.6. OFF-POLICY CONTROL
    12.6.1. Q-Learning
    12.6.2. Windy Grid
    12.6.3. Importance Sampling
    12.7. CONCEPTUAL LINKAGE BETWEEN DP AND TD ALGORITHMS
    12.8. CONVERGENCE OF RL ALGORITHMS
    12.9. KEY TAKEAWAYS FROM THIS CHAPTER
    Chapter 13: Batch RL, Experience-Replay, DQN, LSPI, Gradient TD
    13.1. BATCH RL AND EXPERIENCE-REPLAY
    13.2. A GENERIC IMPLEMENTATION OF EXPERIENCE-REPLAY
    13.3. LEAST-SQUARES RL PREDICTION
    13.3.1. Least-Squares Monte-Carlo (LSMC)
    13.3.2. Least-Squares Temporal-Difference (LSTD)
    13.3.3. LSTD(λ)
    13.3.4. Convergence of Least-Squares Prediction
    13.4. Q-LEARNING WITH EXPERIENCE-REPLAY
    13.4.1. Deep Q-Networks (DQN) Algorithm
    13.5. LEAST-SQUARES POLICY ITERATION (LSPI)
    13.5.1. Saving Your Village from a Vampire
    13.5.2. Least-Squares Control Convergence
    13.6. RL FOR OPTIMAL EXERCISE OF AMERICAN OPTIONS
    13.6.1. LSPI for American Options Pricing
    13.6.2. Deep Q-Learning for American Options Pricing
    13.7. VALUE FUNCTION GEOMETRY
    13.7.1. Notation and Definitions
    13.7.2. Bellman Policy Operator and Projection Operator
    13.7.3. Vectors of Interest in the Φ Subspace
    13.8. GRADIENT TEMPORAL-DIFFERENCE (GRADIENT TD)
    13.9. KEY TAKEAWAYS FROM THIS CHAPTER
    Chapter 14: Policy Gradient Algorithms
    14.1. ADVANTAGES AND DISADVANTAGES OF POLICY GRADIENT ALGORITHMS
    14.2. POLICY GRADIENT THEOREM
    14.2.1. Notation and Definitions
    14.2.2. Statement of the Policy Gradient Theorem
    14.2.3. Proof of the Policy Gradient Theorem
    14.3. SCORE FUNCTION FOR CANONICAL POLICY FUNCTIONS
    14.3.1. Canonical π(s, a; θ) for Finite Action Spaces
    14.3.2. Canonical π(s, a; θ) for Single-Dimensional Continuous Action Spaces
    14.4. REINFORCE ALGORITHM (MONTE-CARLO POLICY GRADIENT)
    14.5. OPTIMAL ASSET ALLOCATION (REVISITED)
    14.6. ACTOR-CRITIC AND VARIANCE REDUCTION
    14.7. OVERCOMING BIAS WITH COMPATIBLE FUNCTION APPROXIMATION
    14.8. POLICY GRADIENT METHODS IN PRACTICE
    14.8.1. Natural Policy Gradient
    14.8.2. Deterministic Policy Gradient
    14.9. EVOLUTIONARY STRATEGIES
    14.10. KEY TAKEAWAYS FROM THIS CHAPTER
    MODULE IV: Finishing Touches
    Chapter 15: Multi-Armed Bandits: Exploration versus Exploitation
    15.1. INTRODUCTION TO THE MULTI-ARMED BANDIT PROBLEM
    15.1.1. Some Examples of Explore-Exploit Dilemma
    15.1.2. Problem Definition
    15.1.3. Regret
    15.1.4. Counts and Gaps
    15.2. SIMPLE ALGORITHMS
    15.2.1. Greedy and ϵ-Greedy
    15.2.2. Optimistic Initialization
    15.2.3. Decaying ϵt-Greedy Algorithm
    15.3. LOWER BOUND
    15.4. UPPER CONFIDENCE BOUND ALGORITHMS
    15.4.1. Hoeffding’s Inequality
    15.4.2. UCB1 Algorithm
    15.4.3. Bayesian UCB
    15.5. PROBABILITY MATCHING
    15.5.1. Thompson Sampling
    15.6. GRADIENT BANDITS
    15.7. HORSE RACE
    15.8. INFORMATION STATE SPACE MDP
    15.9. EXTENDING TO CONTEXTUAL BANDITS AND RL CONTROL
    15.10. KEY TAKEAWAYS FROM THIS CHAPTER
    Chapter 16: Blending Learning and Planning
    16.1. PLANNING VERSUS LEARNING
    16.1.1. Planning the Solution of Prediction/Control
    16.1.2. Learning the Solution of Prediction/Control
    16.1.3. Advantages and Disadvantages of Planning versus Learning
    16.1.4. Blending Planning and Learning
    16.2. DECISION-TIME PLANNING
    16.3. MONTE-CARLO TREE-SEARCH (MCTS)
    16.4. ADAPTIVE MULTI-STAGE SAMPLING
    16.5. SUMMARY OF KEY LEARNINGS FROM THIS CHAPTER
    Chapter 17: Summary and Real-World Considerations
    17.1. SUMMARY OF KEY LEARNINGS FROM THIS BOOK
    17.2. RL IN THE REAL-WORLD
    APPENDIX A: Moment Generating Function and Its Applications
    A.1. THE MOMENT GENERATING FUNCTION (MGF)
    A.2. MGF FOR LINEAR FUNCTIONS OF RANDOM VARIABLES
    A.3. MGF FOR THE NORMAL DISTRIBUTION
    A.4. MINIMIZING THE MGF
    A.4.1. Minimizing the MGF When x Follows a Normal Distribution
    A.4.2. Minimizing the MGF When x Follows a Symmetric Binary Distribution
    APPENDIX B: Portfolio Theory
    B.1. SETTING AND NOTATION
    B.2. PORTFOLIO RETURNS
    B.3. DERIVATION OF EFFICIENT FRONTIER CURVE
    B.4. GLOBAL MINIMUM VARIANCE PORTFOLIO (GMVP)
    B.5. ORTHOGONAL EFFICIENT PORTFOLIOS
    B.6. TWO-FUND THEOREM
    B.7. AN EXAMPLE OF THE EFFICIENT FRONTIER FOR 16 ASSETS
    B.8. CAPM: LINEARITY OF COVARIANCE VECTOR W.R.T. MEAN RETURNS
    B.9. USEFUL COROLLARIES OF CAPM
    B.10. CROSS-SECTIONAL VARIANCE
    B.11. EFFICIENT SET WITH A RISK-FREE ASSET
    APPENDIX C: Introduction to and Overview of Stochastic Calculus Basics
    C.1. SIMPLE RANDOM WALK
    C.2. BROWNIAN MOTION AS SCALED RANDOM WALK
    C.3. CONTINUOUS-TIME STOCHASTIC PROCESSES
    C.4. PROPERTIES OF BROWNIAN MOTION SAMPLE TRACES
    C.5. ITO INTEGRAL
    C.6. ITO’S LEMMA
    C.7. A LOGNORMAL PROCESS
    C.8. A MEAN-REVERTING PROCESS
    APPENDIX D: The Hamilton-Jacobi-Bellman (HJB) Equation
    D.1. HJB AS A CONTINUOUS-TIME VERSION OF BELLMAN OPTIMALITY EQUATION
    D.2. HJB WITH STATE TRANSITIONS AS AN ITO PROCESS
    APPENDIX E: Black-Scholes Equation and Its Solution for Call/Put Options
    E.1. ASSUMPTIONS
    E.2. DERIVATION OF THE BLACK-SCHOLES EQUATION
    E.3. SOLUTION OF THE BLACK-SCHOLES EQUATION FOR CALL/PUT OPTIONS
    APPENDIX F: Function Approximations as Affine Spaces
    F.1. VECTOR SPACE
    F.2. FUNCTION SPACE
    F.3. LINEAR MAP OF VECTOR SPACES
    F.4. AFFINE SPACE
    F.5. AFFINE MAP
    F.6. FUNCTION APPROXIMATIONS
    F.6.1. D[R] as an Affine Space P
    F.6.2. Delegator Space R
    F.7. STOCHASTIC GRADIENT DESCENT
    F.8. SGD UPDATE FOR LINEAR FUNCTION APPROXIMATIONS
    APPENDIX G: Conjugate Priors for Gaussian and Bernoulli Distributions
    G.1. CONJUGATE PRIOR FOR GAUSSIAN DISTRIBUTION
    G.2. CONJUGATE PRIOR FOR BERNOULLI DISTRIBUTION
    Bibliography
    Index