Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

IF 1.9 Q1 MATHEMATICS, APPLIED
K. Khamaru, A. Pananjady, Feng Ruan, M. Wainwright, Michael I. Jordan
{"title":"Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis","authors":"K. Khamaru, A. Pananjady, Feng Ruan, M. Wainwright, Michael I. Jordan","doi":"10.1137/20m1331524","DOIUrl":null,"url":null,"abstract":"We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\\ell_\\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"1 1","pages":"1013-1040"},"PeriodicalIF":1.9000,"publicationDate":"2020-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM journal on mathematics of data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/20m1331524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
引用次数: 39

Abstract

We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.
时间差异学习是最优的吗?依赖实例的分析
我们解决了贴现马尔可夫决策过程中的策略评估问题,并在生成模型下提供了对$\ell_\infty$ -误差的实例依赖保证。我们建立了策略评估的局部极大极小下界的渐近和非渐近版本,从而提供了一个实例相关的基线来比较算法。理论启发的模拟表明,当在非渐近设置中评估时,广泛使用的时间差分(TD)算法是严格次优的,即使与Polyak-Ruppert迭代平均相结合。我们通过引入和分析方差减少形式的随机逼近来解决这个问题,表明它们达到非渐近的、实例相关的最优性,直至对数因子。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信