连续时空中的政策评价与时间差学习:一个鞅方法

Yanwei Jia, X. Zhou
{"title":"连续时空中的政策评价与时间差学习:一个鞅方法","authors":"Yanwei Jia, X. Zhou","doi":"10.2139/ssrn.3905379","DOIUrl":null,"url":null,"abstract":"We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a ``martingale loss function\", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the ``martingale orthogonality conditions\" with ``test functions''. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.","PeriodicalId":139983,"journal":{"name":"Econometrics: Econometric & Statistical Methods - Special Topics eJournal","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach\",\"authors\":\"Yanwei Jia, X. Zhou\",\"doi\":\"10.2139/ssrn.3905379\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a ``martingale loss function\\\", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the ``martingale orthogonality conditions\\\" with ``test functions''. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\\\\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.\",\"PeriodicalId\":139983,\"journal\":{\"name\":\"Econometrics: Econometric & Statistical Methods - Special Topics eJournal\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Econometrics: Econometric & Statistical Methods - Special Topics eJournal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2139/ssrn.3905379\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Econometrics: Econometric & Statistical Methods - Special Topics eJournal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.3905379","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29

摘要

我们提出了一个统一的框架来研究政策评估(PE)和相关的时间差分(TD)方法在连续时间和空间中的强化学习。我们证明了PE等价于维持过程的鞅条件。从这个角度来看,我们发现均方TD误差近似于鞅的二次变化,因此不是PE的合适目标。我们提出了两种使用鞅表征来设计PE算法的方法。第一个是最小化一个鞅损失函数,它的解在均方意义上被证明是真值函数的最佳逼近。这种方法解释了经典的梯度蒙特卡罗算法。第二种方法是基于一个叫做“鞅正交条件”和“测试函数”的方程组。以不同的方式求解这些方程可以恢复各种经典的TD算法,如TD($\lambda$)、LSTD和GTD。测试函数的不同选择决定了得到的解在多大程度上近似于真值函数。此外,我们证明了任何收敛的时间离散算法在网格尺寸趋近于零时收敛于其连续时间对应算法。通过数值实验和应用验证了理论结果和相应的算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach
We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a ``martingale loss function", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the ``martingale orthogonality conditions" with ``test functions''. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信