{"title":"双足步行器的无偏深确定性策略梯度概念","authors":"Timur Ishuov, Zhenis Otarbay, M. Folgheraiter","doi":"10.1109/SIST54437.2022.9945743","DOIUrl":null,"url":null,"abstract":"After a quick overview of convergence issues in the Deep Deterministic Policy Gradient (DDPG) which is based on the Deterministic Policy Gradient (DPG), we put forward a peculiar non-obvious hypothesis that 1) DDPG can be type of on-policy learning and acting algorithm if we consider rewards from mini-batch sample as a relatively stable average reward during a limited time period and a fixed Target Network as fixed actor and critic for the limited time period, and 2) an overestimation in DDPG with the fixed Target Network within specified time may not be an out-of-boundary behavior for low dimensional tasks but a process of reaching regions close to the real Q value's average before converging to better Q values. To empirically show that DDPG with a fixed or stable Target may not exceed Q value limits during training in the OpenAI's Pendulum-v1 Environment, we simplified ideas of Backward Q-learning which combined on-policy and off-policy learning, calling this concept as a unbiased Deep Deterministic Policy Gradient (uDDPG) algorithm. In uDDPG we separately train the Target Network on actual Q values or discounted rewards between episodes (hence “unbiased” in the abbreviation). uDDPG is an anchored version of DDPG. We also use simplified Advantage or difference between current Q Network gradient over actions and current simple moving average of this gradient in updating Action Network. Our purpose is to eventually introduce a less biased, more stable version of DDPG. uDDPG version (DDPG-II) with a function “supernaturally” obtained during experiments that damps weaker fluctuations during policy updates showed promising convergence results.","PeriodicalId":207613,"journal":{"name":"2022 International Conference on Smart Information Systems and Technologies (SIST)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Concept of Unbiased Deep Deterministic Policy Gradient for Better Convergence in Bipedal Walker\",\"authors\":\"Timur Ishuov, Zhenis Otarbay, M. Folgheraiter\",\"doi\":\"10.1109/SIST54437.2022.9945743\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"After a quick overview of convergence issues in the Deep Deterministic Policy Gradient (DDPG) which is based on the Deterministic Policy Gradient (DPG), we put forward a peculiar non-obvious hypothesis that 1) DDPG can be type of on-policy learning and acting algorithm if we consider rewards from mini-batch sample as a relatively stable average reward during a limited time period and a fixed Target Network as fixed actor and critic for the limited time period, and 2) an overestimation in DDPG with the fixed Target Network within specified time may not be an out-of-boundary behavior for low dimensional tasks but a process of reaching regions close to the real Q value's average before converging to better Q values. To empirically show that DDPG with a fixed or stable Target may not exceed Q value limits during training in the OpenAI's Pendulum-v1 Environment, we simplified ideas of Backward Q-learning which combined on-policy and off-policy learning, calling this concept as a unbiased Deep Deterministic Policy Gradient (uDDPG) algorithm. In uDDPG we separately train the Target Network on actual Q values or discounted rewards between episodes (hence “unbiased” in the abbreviation). uDDPG is an anchored version of DDPG. We also use simplified Advantage or difference between current Q Network gradient over actions and current simple moving average of this gradient in updating Action Network. Our purpose is to eventually introduce a less biased, more stable version of DDPG. uDDPG version (DDPG-II) with a function “supernaturally” obtained during experiments that damps weaker fluctuations during policy updates showed promising convergence results.\",\"PeriodicalId\":207613,\"journal\":{\"name\":\"2022 International Conference on Smart Information Systems and Technologies (SIST)\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Smart Information Systems and Technologies (SIST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIST54437.2022.9945743\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Smart Information Systems and Technologies (SIST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIST54437.2022.9945743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Concept of Unbiased Deep Deterministic Policy Gradient for Better Convergence in Bipedal Walker
After a quick overview of convergence issues in the Deep Deterministic Policy Gradient (DDPG) which is based on the Deterministic Policy Gradient (DPG), we put forward a peculiar non-obvious hypothesis that 1) DDPG can be type of on-policy learning and acting algorithm if we consider rewards from mini-batch sample as a relatively stable average reward during a limited time period and a fixed Target Network as fixed actor and critic for the limited time period, and 2) an overestimation in DDPG with the fixed Target Network within specified time may not be an out-of-boundary behavior for low dimensional tasks but a process of reaching regions close to the real Q value's average before converging to better Q values. To empirically show that DDPG with a fixed or stable Target may not exceed Q value limits during training in the OpenAI's Pendulum-v1 Environment, we simplified ideas of Backward Q-learning which combined on-policy and off-policy learning, calling this concept as a unbiased Deep Deterministic Policy Gradient (uDDPG) algorithm. In uDDPG we separately train the Target Network on actual Q values or discounted rewards between episodes (hence “unbiased” in the abbreviation). uDDPG is an anchored version of DDPG. We also use simplified Advantage or difference between current Q Network gradient over actions and current simple moving average of this gradient in updating Action Network. Our purpose is to eventually introduce a less biased, more stable version of DDPG. uDDPG version (DDPG-II) with a function “supernaturally” obtained during experiments that damps weaker fluctuations during policy updates showed promising convergence results.