近端政策优化的部分优势估计器

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Games Pub Date : 2024-06-03 DOI:10.1109/TG.2024.3408298

Yizhao Jin;Xiulei Song;Gregory Slabaugh;Simon Lucas

{"title":"近端政策优化的部分优势估计器","authors":"Yizhao Jin;Xiulei Song;Gregory Slabaugh;Simon Lucas","doi":"10.1109/TG.2024.3408298","DOIUrl":null,"url":null,"abstract":"In this article, we propose an innovative approach to the generalized advantage estimation (GAE) to address the bias-variance tradeoff in truncated roll-outs during reinforcement learning. In typical GAE implementations, the k-step advantage is estimated using a lambda-weighted average, until the terminal state. While this method provides constant bias-variance properties at any time step, it often necessitates truncated roll-outs with shorter horizons for faster learning and policy updates within a single episode. This study highlights an unexplored issue: the bias-variance properties differ for small versus considerable time steps within truncated roll-outs. Specifically, smaller time steps may have a significant bias, prompting a need for their increase. The proposed solution involves a partial GAE update, calculating the advantage estimates for all time steps but updating the policy only for a specified range. To prevent data wastage, the data from this range is retained for further processing and policy parameter updates. This partial GAE approach, despite the increased memory requirements, promises enhanced computation speed and optimal data utilization. Empirical validation was conducted on four MuJoCo tasks and microreal-time strategy (RTS). The results show a performance improvement trend with the partial GAE estimator, outperforming regular GAE in task completion speed in microRTS. These findings offer a promising direction for improving policy update efficiency in reinforcement learning.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 1","pages":"158-166"},"PeriodicalIF":1.7000,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Partial Advantage Estimator for Proximal Policy Optimization\",\"authors\":\"Yizhao Jin;Xiulei Song;Gregory Slabaugh;Simon Lucas\",\"doi\":\"10.1109/TG.2024.3408298\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this article, we propose an innovative approach to the generalized advantage estimation (GAE) to address the bias-variance tradeoff in truncated roll-outs during reinforcement learning. In typical GAE implementations, the k-step advantage is estimated using a lambda-weighted average, until the terminal state. While this method provides constant bias-variance properties at any time step, it often necessitates truncated roll-outs with shorter horizons for faster learning and policy updates within a single episode. This study highlights an unexplored issue: the bias-variance properties differ for small versus considerable time steps within truncated roll-outs. Specifically, smaller time steps may have a significant bias, prompting a need for their increase. The proposed solution involves a partial GAE update, calculating the advantage estimates for all time steps but updating the policy only for a specified range. To prevent data wastage, the data from this range is retained for further processing and policy parameter updates. This partial GAE approach, despite the increased memory requirements, promises enhanced computation speed and optimal data utilization. Empirical validation was conducted on four MuJoCo tasks and microreal-time strategy (RTS). The results show a performance improvement trend with the partial GAE estimator, outperforming regular GAE in task completion speed in microRTS. These findings offer a promising direction for improving policy update efficiency in reinforcement learning.\",\"PeriodicalId\":55977,\"journal\":{\"name\":\"IEEE Transactions on Games\",\"volume\":\"17 1\",\"pages\":\"158-166\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-06-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Games\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10546313/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10546313/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们提出了一种创新的广义优势估计（GAE）方法，以解决强化学习期间截断滚动中的偏差-方差权衡。在典型的GAE实现中，使用lambda加权平均值来估计k步优势，直到最终状态。虽然这种方法在任何时间步骤都提供了恒定的偏差方差属性，但它通常需要在单个事件中以更短的时间跨度截断滚动，以便更快地学习和策略更新。这项研究强调了一个未被探索的问题：在截断的推出中，偏差方差属性在小的和大的时间步长上是不同的。具体地说，较小的时间步长可能有明显的偏差，促使需要增加它们。建议的解决方案涉及部分GAE更新，计算所有时间步长的优势估计，但仅在指定范围内更新策略。为了防止数据浪费，将保留此范围内的数据，以便进一步处理和更新策略参数。这种部分GAE方法尽管增加了内存需求，但保证了提高的计算速度和最佳的数据利用率。对四种MuJoCo任务和微实时策略（RTS）进行实证验证。结果表明，在micrororts中，部分GAE估计器的性能有提高趋势，在任务完成速度上优于常规GAE估计器。这些发现为提高强化学习中的策略更新效率提供了一个有希望的方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Partial Advantage Estimator for Proximal Policy Optimization

In this article, we propose an innovative approach to the generalized advantage estimation (GAE) to address the bias-variance tradeoff in truncated roll-outs during reinforcement learning. In typical GAE implementations, the k-step advantage is estimated using a lambda-weighted average, until the terminal state. While this method provides constant bias-variance properties at any time step, it often necessitates truncated roll-outs with shorter horizons for faster learning and policy updates within a single episode. This study highlights an unexplored issue: the bias-variance properties differ for small versus considerable time steps within truncated roll-outs. Specifically, smaller time steps may have a significant bias, prompting a need for their increase. The proposed solution involves a partial GAE update, calculating the advantage estimates for all time steps but updating the policy only for a specified range. To prevent data wastage, the data from this range is retained for further processing and policy parameter updates. This partial GAE approach, despite the increased memory requirements, promises enhanced computation speed and optimal data utilization. Empirical validation was conducted on four MuJoCo tasks and microreal-time strategy (RTS). The results show a performance improvement trend with the partial GAE estimator, outperforming regular GAE in task completion speed in microRTS. These findings offer a promising direction for improving policy update efficiency in reinforcement learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Games Engineering-Electrical and Electronic Engineering

CiteScore

4.60

自引率

8.70%

发文量