双足步行器的无偏深确定性策略梯度概念

2022 International Conference on Smart Information Systems and Technologies (SIST) Pub Date : 2022-04-28 DOI:10.1109/SIST54437.2022.9945743

Timur Ishuov, Zhenis Otarbay, M. Folgheraiter

{"title":"双足步行器的无偏深确定性策略梯度概念","authors":"Timur Ishuov, Zhenis Otarbay, M. Folgheraiter","doi":"10.1109/SIST54437.2022.9945743","DOIUrl":null,"url":null,"abstract":"After a quick overview of convergence issues in the Deep Deterministic Policy Gradient (DDPG) which is based on the Deterministic Policy Gradient (DPG), we put forward a peculiar non-obvious hypothesis that 1) DDPG can be type of on-policy learning and acting algorithm if we consider rewards from mini-batch sample as a relatively stable average reward during a limited time period and a fixed Target Network as fixed actor and critic for the limited time period, and 2) an overestimation in DDPG with the fixed Target Network within specified time may not be an out-of-boundary behavior for low dimensional tasks but a process of reaching regions close to the real Q value's average before converging to better Q values. To empirically show that DDPG with a fixed or stable Target may not exceed Q value limits during training in the OpenAI's Pendulum-v1 Environment, we simplified ideas of Backward Q-learning which combined on-policy and off-policy learning, calling this concept as a unbiased Deep Deterministic Policy Gradient (uDDPG) algorithm. In uDDPG we separately train the Target Network on actual Q values or discounted rewards between episodes (hence “unbiased” in the abbreviation). uDDPG is an anchored version of DDPG. We also use simplified Advantage or difference between current Q Network gradient over actions and current simple moving average of this gradient in updating Action Network. Our purpose is to eventually introduce a less biased, more stable version of DDPG. uDDPG version (DDPG-II) with a function “supernaturally” obtained during experiments that damps weaker fluctuations during policy updates showed promising convergence results.","PeriodicalId":207613,"journal":{"name":"2022 International Conference on Smart Information Systems and Technologies (SIST)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Concept of Unbiased Deep Deterministic Policy Gradient for Better Convergence in Bipedal Walker\",\"authors\":\"Timur Ishuov, Zhenis Otarbay, M. Folgheraiter\",\"doi\":\"10.1109/SIST54437.2022.9945743\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"After a quick overview of convergence issues in the Deep Deterministic Policy Gradient (DDPG) which is based on the Deterministic Policy Gradient (DPG), we put forward a peculiar non-obvious hypothesis that 1) DDPG can be type of on-policy learning and acting algorithm if we consider rewards from mini-batch sample as a relatively stable average reward during a limited time period and a fixed Target Network as fixed actor and critic for the limited time period, and 2) an overestimation in DDPG with the fixed Target Network within specified time may not be an out-of-boundary behavior for low dimensional tasks but a process of reaching regions close to the real Q value's average before converging to better Q values. To empirically show that DDPG with a fixed or stable Target may not exceed Q value limits during training in the OpenAI's Pendulum-v1 Environment, we simplified ideas of Backward Q-learning which combined on-policy and off-policy learning, calling this concept as a unbiased Deep Deterministic Policy Gradient (uDDPG) algorithm. In uDDPG we separately train the Target Network on actual Q values or discounted rewards between episodes (hence “unbiased” in the abbreviation). uDDPG is an anchored version of DDPG. We also use simplified Advantage or difference between current Q Network gradient over actions and current simple moving average of this gradient in updating Action Network. Our purpose is to eventually introduce a less biased, more stable version of DDPG. uDDPG version (DDPG-II) with a function “supernaturally” obtained during experiments that damps weaker fluctuations during policy updates showed promising convergence results.\",\"PeriodicalId\":207613,\"journal\":{\"name\":\"2022 International Conference on Smart Information Systems and Technologies (SIST)\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Smart Information Systems and Technologies (SIST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIST54437.2022.9945743\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Smart Information Systems and Technologies (SIST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIST54437.2022.9945743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在对基于确定性策略梯度(Deterministic Policy Gradient, DPG)的深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)的收敛问题进行简要概述后，我们提出了一个独特的非明显假设:1)如果我们将小批样本的奖励作为有限时间内相对稳定的平均奖励，将固定目标网络作为有限时间内的固定行动者和批评家，那么DDPG可以是一种非策略学习和行为算法;2)目标网络固定的DDPG在规定时间内的过估计可能不是低维任务的越界行为，而是到达接近真实Q值平均值的区域，然后收敛到较好的Q值的过程。为了在OpenAI的pendum -v1环境中经验地证明具有固定或稳定目标的DDPG在训练过程中可能不会超过Q值限制，我们简化了结合了on-policy和off-policy学习的Backward Q-learning思想，将此概念称为无偏深度确定性策略梯度(uDDPG)算法。在uDDPG中，我们根据实际Q值或剧集之间的折扣奖励分别训练目标网络(因此简称为“无偏”)。uDDPG是DDPG的锚定版本。在更新动作网络时，我们还使用了简化的优势或当前Q网络梯度与当前该梯度的简单移动平均之间的差异。我们的目的是最终引入一个更少偏见、更稳定的DDPG版本。uDDPG版本(DDPG-II)具有在实验中获得的“超自然”函数，该函数可以抑制政策更新期间较弱的波动，显示出有希望的收敛结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Concept of Unbiased Deep Deterministic Policy Gradient for Better Convergence in Bipedal Walker

After a quick overview of convergence issues in the Deep Deterministic Policy Gradient (DDPG) which is based on the Deterministic Policy Gradient (DPG), we put forward a peculiar non-obvious hypothesis that 1) DDPG can be type of on-policy learning and acting algorithm if we consider rewards from mini-batch sample as a relatively stable average reward during a limited time period and a fixed Target Network as fixed actor and critic for the limited time period, and 2) an overestimation in DDPG with the fixed Target Network within specified time may not be an out-of-boundary behavior for low dimensional tasks but a process of reaching regions close to the real Q value's average before converging to better Q values. To empirically show that DDPG with a fixed or stable Target may not exceed Q value limits during training in the OpenAI's Pendulum-v1 Environment, we simplified ideas of Backward Q-learning which combined on-policy and off-policy learning, calling this concept as a unbiased Deep Deterministic Policy Gradient (uDDPG) algorithm. In uDDPG we separately train the Target Network on actual Q values or discounted rewards between episodes (hence “unbiased” in the abbreviation). uDDPG is an anchored version of DDPG. We also use simplified Advantage or difference between current Q Network gradient over actions and current simple moving average of this gradient in updating Action Network. Our purpose is to eventually introduce a less biased, more stable version of DDPG. uDDPG version (DDPG-II) with a function “supernaturally” obtained during experiments that damps weaker fluctuations during policy updates showed promising convergence results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 International Conference on Smart Information Systems and Technologies (SIST)

自引率

0.00%

发文量