时间差异学习是最优的吗?依赖实例的分析

IF 2.6 Q1 MATHEMATICS, APPLIED

SIAM journal on mathematics of data science Pub Date : 2020-03-16 DOI:10.1137/20m1331524

K. Khamaru, A. Pananjady, Feng Ruan, M. Wainwright, Michael I. Jordan

{"title":"时间差异学习是最优的吗?依赖实例的分析","authors":"K. Khamaru, A. Pananjady, Feng Ruan, M. Wainwright, Michael I. Jordan","doi":"10.1137/20m1331524","DOIUrl":null,"url":null,"abstract":"We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\\ell_\\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"1 1","pages":"1013-1040"},"PeriodicalIF":2.6000,"publicationDate":"2020-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":"{\"title\":\"Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis\",\"authors\":\"K. Khamaru, A. Pananjady, Feng Ruan, M. Wainwright, Michael I. Jordan\",\"doi\":\"10.1137/20m1331524\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\\\\ell_\\\\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.\",\"PeriodicalId\":74797,\"journal\":{\"name\":\"SIAM journal on mathematics of data science\",\"volume\":\"1 1\",\"pages\":\"1013-1040\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2020-03-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"39\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIAM journal on mathematics of data science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1137/20m1331524\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM journal on mathematics of data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/20m1331524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 39

摘要

我们解决了贴现马尔可夫决策过程中的策略评估问题，并在生成模型下提供了对$\ell_\infty$ -误差的实例依赖保证。我们建立了策略评估的局部极大极小下界的渐近和非渐近版本，从而提供了一个实例相关的基线来比较算法。理论启发的模拟表明，当在非渐近设置中评估时，广泛使用的时间差分(TD)算法是严格次优的，即使与Polyak-Ruppert迭代平均相结合。我们通过引入和分析方差减少形式的随机逼近来解决这个问题，表明它们达到非渐近的、实例相关的最优性，直至对数因子。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SIAM journal on mathematics of data science

自引率

0.00%

发文量