Off-policy correction algorithm for double Q network based on deep reinforcement learning

IF 1.5 Q3 AUTOMATION & CONTROL SYSTEMS
Qingbo Zhang, Manlu Liu, Heng Wang, Weimin Qian, Xinglang Zhang
{"title":"Off-policy correction algorithm for double Q network based on deep reinforcement learning","authors":"Qingbo Zhang,&nbsp;Manlu Liu,&nbsp;Heng Wang,&nbsp;Weimin Qian,&nbsp;Xinglang Zhang","doi":"10.1049/csy2.12102","DOIUrl":null,"url":null,"abstract":"<p>A deep reinforcement learning (DRL) method based on the deep deterministic policy gradient (DDPG) algorithm is proposed to address the problems of a mismatch between the needed training samples and the actual training samples during the training of intelligence, the overestimation and underestimation of the existence of Q-values, and the insufficient dynamism of the intelligence policy exploration. This method introduces the Actor-Critic Off-Policy Correction (AC-Off-POC) reinforcement learning framework and an improved double Q-value learning method, which enables the value function network in the target task to provide a more accurate evaluation of the policy network and converge to the optimal policy more quickly and stably to obtain higher value returns. The method is applied to multiple MuJoCo tasks on the Open AI Gym simulation platform. The experimental results show that it is better than the DDPG algorithm based solely on the different policy correction framework (AC-Off-POC) and the conventional DRL algorithm. The value of returns and stability of the double-Q-network off-policy correction algorithm for the deep deterministic policy gradient (DCAOP-DDPG) proposed by the authors are significantly higher than those of other DRL algorithms.</p>","PeriodicalId":34110,"journal":{"name":"IET Cybersystems and Robotics","volume":"5 4","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.12102","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Cybersystems and Robotics","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/csy2.12102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

A deep reinforcement learning (DRL) method based on the deep deterministic policy gradient (DDPG) algorithm is proposed to address the problems of a mismatch between the needed training samples and the actual training samples during the training of intelligence, the overestimation and underestimation of the existence of Q-values, and the insufficient dynamism of the intelligence policy exploration. This method introduces the Actor-Critic Off-Policy Correction (AC-Off-POC) reinforcement learning framework and an improved double Q-value learning method, which enables the value function network in the target task to provide a more accurate evaluation of the policy network and converge to the optimal policy more quickly and stably to obtain higher value returns. The method is applied to multiple MuJoCo tasks on the Open AI Gym simulation platform. The experimental results show that it is better than the DDPG algorithm based solely on the different policy correction framework (AC-Off-POC) and the conventional DRL algorithm. The value of returns and stability of the double-Q-network off-policy correction algorithm for the deep deterministic policy gradient (DCAOP-DDPG) proposed by the authors are significantly higher than those of other DRL algorithms.

Abstract Image

基于深度强化学习的双 Q 网络偏离策略修正算法
针对智能训练过程中存在的所需训练样本与实际训练样本不匹配、高估和低估Q值存在性、智能策略探索动态性不足等问题,提出了一种基于深度确定性策略梯度(DDPG)算法的深度强化学习(DRL)方法。该方法引入了行动者-批判者偏离策略修正(AC-Off-POC)强化学习框架和改进的双Q值学习方法,使目标任务中的价值函数网络能够对策略网络进行更准确的评估,更快速稳定地收敛到最优策略,从而获得更高的价值回报。该方法在开放人工智能体育馆仿真平台上应用于多个 MuJoCo 任务。实验结果表明,该方法优于仅基于不同策略修正框架(AC-Off-POC)的 DDPG 算法和传统的 DRL 算法。作者提出的深度确定性策略梯度的双Q网络非策略修正算法(DCAOP-DDPG)的收益值和稳定性明显高于其他DRL算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IET Cybersystems and Robotics
IET Cybersystems and Robotics Computer Science-Information Systems
CiteScore
3.70
自引率
0.00%
发文量
31
审稿时长
34 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信