Foundations and Trends® in Optimization > Vol 6 > Issue 3

Stochastic Optimization Methods for Policy Evaluation in Reinforcement Learning

By Yi Zhou, University of Utah, USA | Shaocong Ma, University of Utah, USA, s.ma@utah.edu

 
Suggested Citation
Yi Zhou and Shaocong Ma (2024), "Stochastic Optimization Methods for Policy Evaluation in Reinforcement Learning", Foundations and Trends® in Optimization: Vol. 6: No. 3, pp 145-192. http://dx.doi.org/10.1561/2400000045

Publication Date: 15 Aug 2024
© 2024 Y. Zhou and S. Ma
 
Subjects
Reinforcement learning,  Online learning,  Computational complexity,  Stochastic optimization
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction to Reinforcement Learning
2. Introduction to Optimization
3. Introduction to The Policy Evaluation Problem
4. Policy Evaluation with Known Transition Kernel
5. Model-Free Policy Evaluation Algorithms
References

Abstract

This monograph introduces various value-based approaches for solving the policy evaluation problem in the online reinforcement learning (RL) scenario, which aims to learn the value function associated with a specific policy under a single Markov decision process (MDP). Approaches vary depending on whether they are implemented in an on-policy or off-policy manner: In on-policy settings, where the evaluation of the policy is conducted using data generated from the same policy that is being assessed, classical techniques such as TD(0), TD(λ), and their extensions with function approximation or variance reduction are employed in this setting. For off-policy evaluation, where samples are collected under a different behavior policy, this monograph introduces gradient-based two-timescale algorithms like GTD2, TDC, and variance-reduced TDC. These algorithms are designed to minimize the mean-squared projected Bellman error (MSPBE) as the objective function. This monograph also discusses their finite-sample convergence upper bounds and sample complexity.

DOI:10.1561/2400000045
ISBN: 978-1-63828-370-6
60 pp. $55.00
Buy book (pb)
 
ISBN: 978-1-63828-371-3
60 pp. $155.00
Buy E-book (.pdf)
Table of contents:
1. Introduction to Reinforcement Learning
2. Introduction to Optimization
3. Introduction to The Policy Evaluation Problem
4. Policy Evaluation with Known Transition Kernel
5. Model-Free Policy Evaluation Algorithms
References

Stochastic Optimization Methods for Policy Evaluation in Reinforcement Learning

This monograph introduces various value-based approaches for solving the policy evaluation problem in the online reinforcement learning (RL) scenario, which aims to learn the value function associated with a specific policy under a single Markov decision process (MDP). Approaches vary depending on whether they are implemented in an on-policy or off-policy manner. In on-policy settings, where the evaluation of the policy is conducted using data generated from the same policy that is being assessed, classical techniques such as TD(0), TD(λ), and their extensions with function approximation or variance reduction are employed in this setting. For off-policy evaluation, where samples are collected under a different behavior policy, this monograph introduces gradient-based two-timescale algorithms like GTD2, TDC, and variance-reduced TDC. These algorithms are designed to minimize the mean-squared projected Bellman error (MSPBE) as the objective function. This monograph also discusses their finite-sample convergence upper bounds and sample complexity.

 
OPT-045