时间差异学习是最优的吗?依赖实例的分析

IF 1.9 Q1 MATHEMATICS, APPLIED
K. Khamaru, A. Pananjady, Feng Ruan, M. Wainwright, Michael I. Jordan
{"title":"时间差异学习是最优的吗?依赖实例的分析","authors":"K. Khamaru, A. Pananjady, Feng Ruan, M. Wainwright, Michael I. Jordan","doi":"10.1137/20m1331524","DOIUrl":null,"url":null,"abstract":"We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\\ell_\\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"1 1","pages":"1013-1040"},"PeriodicalIF":1.9000,"publicationDate":"2020-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":"{\"title\":\"Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis\",\"authors\":\"K. Khamaru, A. Pananjady, Feng Ruan, M. Wainwright, Michael I. Jordan\",\"doi\":\"10.1137/20m1331524\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\\\\ell_\\\\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.\",\"PeriodicalId\":74797,\"journal\":{\"name\":\"SIAM journal on mathematics of data science\",\"volume\":\"1 1\",\"pages\":\"1013-1040\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2020-03-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"39\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIAM journal on mathematics of data science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1137/20m1331524\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM journal on mathematics of data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/20m1331524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
引用次数: 39

摘要

我们解决了贴现马尔可夫决策过程中的策略评估问题,并在生成模型下提供了对$\ell_\infty$ -误差的实例依赖保证。我们建立了策略评估的局部极大极小下界的渐近和非渐近版本,从而提供了一个实例相关的基线来比较算法。理论启发的模拟表明,当在非渐近设置中评估时,广泛使用的时间差分(TD)算法是严格次优的,即使与Polyak-Ruppert迭代平均相结合。我们通过引入和分析方差减少形式的随机逼近来解决这个问题,表明它们达到非渐近的、实例相关的最优性,直至对数因子。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis
We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信