Yizhao Jin;Xiulei Song;Gregory Slabaugh;Simon Lucas
{"title":"近端政策优化的部分优势估计器","authors":"Yizhao Jin;Xiulei Song;Gregory Slabaugh;Simon Lucas","doi":"10.1109/TG.2024.3408298","DOIUrl":null,"url":null,"abstract":"In this article, we propose an innovative approach to the generalized advantage estimation (GAE) to address the bias-variance tradeoff in truncated roll-outs during reinforcement learning. In typical GAE implementations, the k-step advantage is estimated using a lambda-weighted average, until the terminal state. While this method provides constant bias-variance properties at any time step, it often necessitates truncated roll-outs with shorter horizons for faster learning and policy updates within a single episode. This study highlights an unexplored issue: the bias-variance properties differ for small versus considerable time steps within truncated roll-outs. Specifically, smaller time steps may have a significant bias, prompting a need for their increase. The proposed solution involves a partial GAE update, calculating the advantage estimates for all time steps but updating the policy only for a specified range. To prevent data wastage, the data from this range is retained for further processing and policy parameter updates. This partial GAE approach, despite the increased memory requirements, promises enhanced computation speed and optimal data utilization. Empirical validation was conducted on four MuJoCo tasks and microreal-time strategy (RTS). The results show a performance improvement trend with the partial GAE estimator, outperforming regular GAE in task completion speed in microRTS. These findings offer a promising direction for improving policy update efficiency in reinforcement learning.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 1","pages":"158-166"},"PeriodicalIF":1.7000,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Partial Advantage Estimator for Proximal Policy Optimization\",\"authors\":\"Yizhao Jin;Xiulei Song;Gregory Slabaugh;Simon Lucas\",\"doi\":\"10.1109/TG.2024.3408298\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this article, we propose an innovative approach to the generalized advantage estimation (GAE) to address the bias-variance tradeoff in truncated roll-outs during reinforcement learning. In typical GAE implementations, the k-step advantage is estimated using a lambda-weighted average, until the terminal state. While this method provides constant bias-variance properties at any time step, it often necessitates truncated roll-outs with shorter horizons for faster learning and policy updates within a single episode. This study highlights an unexplored issue: the bias-variance properties differ for small versus considerable time steps within truncated roll-outs. Specifically, smaller time steps may have a significant bias, prompting a need for their increase. The proposed solution involves a partial GAE update, calculating the advantage estimates for all time steps but updating the policy only for a specified range. To prevent data wastage, the data from this range is retained for further processing and policy parameter updates. This partial GAE approach, despite the increased memory requirements, promises enhanced computation speed and optimal data utilization. Empirical validation was conducted on four MuJoCo tasks and microreal-time strategy (RTS). The results show a performance improvement trend with the partial GAE estimator, outperforming regular GAE in task completion speed in microRTS. These findings offer a promising direction for improving policy update efficiency in reinforcement learning.\",\"PeriodicalId\":55977,\"journal\":{\"name\":\"IEEE Transactions on Games\",\"volume\":\"17 1\",\"pages\":\"158-166\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-06-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Games\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10546313/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10546313/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Partial Advantage Estimator for Proximal Policy Optimization
In this article, we propose an innovative approach to the generalized advantage estimation (GAE) to address the bias-variance tradeoff in truncated roll-outs during reinforcement learning. In typical GAE implementations, the k-step advantage is estimated using a lambda-weighted average, until the terminal state. While this method provides constant bias-variance properties at any time step, it often necessitates truncated roll-outs with shorter horizons for faster learning and policy updates within a single episode. This study highlights an unexplored issue: the bias-variance properties differ for small versus considerable time steps within truncated roll-outs. Specifically, smaller time steps may have a significant bias, prompting a need for their increase. The proposed solution involves a partial GAE update, calculating the advantage estimates for all time steps but updating the policy only for a specified range. To prevent data wastage, the data from this range is retained for further processing and policy parameter updates. This partial GAE approach, despite the increased memory requirements, promises enhanced computation speed and optimal data utilization. Empirical validation was conducted on four MuJoCo tasks and microreal-time strategy (RTS). The results show a performance improvement trend with the partial GAE estimator, outperforming regular GAE in task completion speed in microRTS. These findings offer a promising direction for improving policy update efficiency in reinforcement learning.