Off-policy correction algorithm for double Q network based on deep reinforcement learning

IF 1.2 Q3 AUTOMATION & CONTROL SYSTEMS

IET Cybersystems and Robotics Pub Date : 2023-12-21 DOI:10.1049/csy2.12102

Qingbo Zhang, Manlu Liu, Heng Wang, Weimin Qian, Xinglang Zhang

{"title":"Off-policy correction algorithm for double Q network based on deep reinforcement learning","authors":"Qingbo Zhang, Manlu Liu, Heng Wang, Weimin Qian, Xinglang Zhang","doi":"10.1049/csy2.12102","DOIUrl":null,"url":null,"abstract":"<p>A deep reinforcement learning (DRL) method based on the deep deterministic policy gradient (DDPG) algorithm is proposed to address the problems of a mismatch between the needed training samples and the actual training samples during the training of intelligence, the overestimation and underestimation of the existence of Q-values, and the insufficient dynamism of the intelligence policy exploration. This method introduces the Actor-Critic Off-Policy Correction (AC-Off-POC) reinforcement learning framework and an improved double Q-value learning method, which enables the value function network in the target task to provide a more accurate evaluation of the policy network and converge to the optimal policy more quickly and stably to obtain higher value returns. The method is applied to multiple MuJoCo tasks on the Open AI Gym simulation platform. The experimental results show that it is better than the DDPG algorithm based solely on the different policy correction framework (AC-Off-POC) and the conventional DRL algorithm. The value of returns and stability of the double-Q-network off-policy correction algorithm for the deep deterministic policy gradient (DCAOP-DDPG) proposed by the authors are significantly higher than those of other DRL algorithms.</p>","PeriodicalId":34110,"journal":{"name":"IET Cybersystems and Robotics","volume":"5 4","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.12102","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Cybersystems and Robotics","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/csy2.12102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

A deep reinforcement learning (DRL) method based on the deep deterministic policy gradient (DDPG) algorithm is proposed to address the problems of a mismatch between the needed training samples and the actual training samples during the training of intelligence, the overestimation and underestimation of the existence of Q-values, and the insufficient dynamism of the intelligence policy exploration. This method introduces the Actor-Critic Off-Policy Correction (AC-Off-POC) reinforcement learning framework and an improved double Q-value learning method, which enables the value function network in the target task to provide a more accurate evaluation of the policy network and converge to the optimal policy more quickly and stably to obtain higher value returns. The method is applied to multiple MuJoCo tasks on the Open AI Gym simulation platform. The experimental results show that it is better than the DDPG algorithm based solely on the different policy correction framework (AC-Off-POC) and the conventional DRL algorithm. The value of returns and stability of the double-Q-network off-policy correction algorithm for the deep deterministic policy gradient (DCAOP-DDPG) proposed by the authors are significantly higher than those of other DRL algorithms.

Abstract Image

查看原文本刊更多论文

基于深度强化学习的双 Q 网络偏离策略修正算法

针对智能训练过程中存在的所需训练样本与实际训练样本不匹配、高估和低估Q值存在性、智能策略探索动态性不足等问题，提出了一种基于深度确定性策略梯度（DDPG）算法的深度强化学习（DRL）方法。该方法引入了行动者-批判者偏离策略修正（AC-Off-POC）强化学习框架和改进的双Q值学习方法，使目标任务中的价值函数网络能够对策略网络进行更准确的评估，更快速稳定地收敛到最优策略，从而获得更高的价值回报。该方法在开放人工智能体育馆仿真平台上应用于多个 MuJoCo 任务。实验结果表明，该方法优于仅基于不同策略修正框架（AC-Off-POC）的 DDPG 算法和传统的 DRL 算法。作者提出的深度确定性策略梯度的双Q网络非策略修正算法（DCAOP-DDPG）的收益值和稳定性明显高于其他DRL算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊