连续时空中的政策评价与时间差学习:一个鞅方法

Econometrics: Econometric & Statistical Methods - Special Topics eJournal Pub Date : 2021-08-14 DOI:10.2139/ssrn.3905379

Yanwei Jia, X. Zhou

{"title":"连续时空中的政策评价与时间差学习:一个鞅方法","authors":"Yanwei Jia, X. Zhou","doi":"10.2139/ssrn.3905379","DOIUrl":null,"url":null,"abstract":"We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a ``martingale loss function\", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the ``martingale orthogonality conditions\" with ``test functions''. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.","PeriodicalId":139983,"journal":{"name":"Econometrics: Econometric & Statistical Methods - Special Topics eJournal","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach\",\"authors\":\"Yanwei Jia, X. Zhou\",\"doi\":\"10.2139/ssrn.3905379\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a ``martingale loss function\\\", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the ``martingale orthogonality conditions\\\" with ``test functions''. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\\\\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.\",\"PeriodicalId\":139983,\"journal\":{\"name\":\"Econometrics: Econometric & Statistical Methods - Special Topics eJournal\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Econometrics: Econometric & Statistical Methods - Special Topics eJournal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2139/ssrn.3905379\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Econometrics: Econometric & Statistical Methods - Special Topics eJournal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.3905379","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

摘要

我们提出了一个统一的框架来研究政策评估(PE)和相关的时间差分(TD)方法在连续时间和空间中的强化学习。我们证明了PE等价于维持过程的鞅条件。从这个角度来看，我们发现均方TD误差近似于鞅的二次变化，因此不是PE的合适目标。我们提出了两种使用鞅表征来设计PE算法的方法。第一个是最小化一个鞅损失函数，它的解在均方意义上被证明是真值函数的最佳逼近。这种方法解释了经典的梯度蒙特卡罗算法。第二种方法是基于一个叫做“鞅正交条件”和“测试函数”的方程组。以不同的方式求解这些方程可以恢复各种经典的TD算法，如TD($\lambda$)、LSTD和GTD。测试函数的不同选择决定了得到的解在多大程度上近似于真值函数。此外，我们证明了任何收敛的时间离散算法在网格尺寸趋近于零时收敛于其连续时间对应算法。通过数值实验和应用验证了理论结果和相应的算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach

We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a ``martingale loss function", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the ``martingale orthogonality conditions" with ``test functions''. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Econometrics: Econometric & Statistical Methods - Special Topics eJournal

自引率

0.00%

发文量