高效的逆向引导演员批评家

IF 2.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Games Pub Date : 2024-09-03 DOI:10.1109/TG.2024.3453444

Mao Xu;Shuzhi Sam Ge;Dongjie Zhao;Qian Zhao

{"title":"高效的逆向引导演员批评家","authors":"Mao Xu;Shuzhi Sam Ge;Dongjie Zhao;Qian Zhao","doi":"10.1109/TG.2024.3453444","DOIUrl":null,"url":null,"abstract":"Exploring procedurally-generated environments presents a formidable challenge in model-free deep reinforcement learning (RL). One state-of-the-art exploration method, adversarially guided actor–critic (AGAC), employs adversarial learning to drive exploration by diversifying the actions of the deep RL agent. Specifically, in the actor–critic (AC) framework, which consists of a policy (the actor) and a value function (the critic), AGAC introduces an adversary that mimics the actor. AGAC then constructs an action-based adversarial advantage (ABAA) to update the actor. This ABAA guides the deep RL agent toward actions that diverge from the adversary's predictions while maximizing expected returns. Although the ABAA drives AGAC to explore procedurally-generated environments, it can affect the balance between exploration and exploitation during the training period, thereby impairing AGAC's performance. To mitigate this adverse effect and improve AGAC's performance, we propose efficient adversarially guided actor–critic (EAGAC). EAGAC introduces a state-based adversarial advantage (SBAA) that directs the deep RL agent toward actions leading to states with different action distributions from those of the adversary while maximizing expected returns. EAGAC combines this SBAA with the ABAA to form a joint adversarial advantage, and then employs this joint adversarial advantage to update the actor. To further reduce this adverse effect and enhance performance, EAGAC stores past positive episodes in the replay buffer and utilizes experiences sampled from this buffer to optimize the actor through self-imitation learning (SIL). The experimental results in procedurally-generated environments from MiniGrid and the 3-D navigation environment from ViZDoom show our EAGAC method significantly outperforms AGAC and other state-of-the-art exploration methods in both sample efficiency and final performance.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 2","pages":"346-359"},"PeriodicalIF":2.8000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Adversarially Guided Actor–Critic\",\"authors\":\"Mao Xu;Shuzhi Sam Ge;Dongjie Zhao;Qian Zhao\",\"doi\":\"10.1109/TG.2024.3453444\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Exploring procedurally-generated environments presents a formidable challenge in model-free deep reinforcement learning (RL). One state-of-the-art exploration method, adversarially guided actor–critic (AGAC), employs adversarial learning to drive exploration by diversifying the actions of the deep RL agent. Specifically, in the actor–critic (AC) framework, which consists of a policy (the actor) and a value function (the critic), AGAC introduces an adversary that mimics the actor. AGAC then constructs an action-based adversarial advantage (ABAA) to update the actor. This ABAA guides the deep RL agent toward actions that diverge from the adversary's predictions while maximizing expected returns. Although the ABAA drives AGAC to explore procedurally-generated environments, it can affect the balance between exploration and exploitation during the training period, thereby impairing AGAC's performance. To mitigate this adverse effect and improve AGAC's performance, we propose efficient adversarially guided actor–critic (EAGAC). EAGAC introduces a state-based adversarial advantage (SBAA) that directs the deep RL agent toward actions leading to states with different action distributions from those of the adversary while maximizing expected returns. EAGAC combines this SBAA with the ABAA to form a joint adversarial advantage, and then employs this joint adversarial advantage to update the actor. To further reduce this adverse effect and enhance performance, EAGAC stores past positive episodes in the replay buffer and utilizes experiences sampled from this buffer to optimize the actor through self-imitation learning (SIL). The experimental results in procedurally-generated environments from MiniGrid and the 3-D navigation environment from ViZDoom show our EAGAC method significantly outperforms AGAC and other state-of-the-art exploration methods in both sample efficiency and final performance.\",\"PeriodicalId\":55977,\"journal\":{\"name\":\"IEEE Transactions on Games\",\"volume\":\"17 2\",\"pages\":\"346-359\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Games\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10663959/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10663959/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在无模型深度强化学习（RL）中，探索程序生成的环境是一个巨大的挑战。一种最先进的探索方法是对抗引导的actor-critic (AGAC)，它通过多样化深度强化学习代理的行为，利用对抗学习来驱动探索。具体来说，在由策略（参与者）和价值函数（批评者）组成的参与者-评论家（actor - critic）框架中，AGAC引入了一个模仿参与者的对手。然后，AGAC构建一个基于行动的对抗优势（ABAA）来更新行动者。这个ABAA引导深度强化学习代理采取与对手预测不同的行动，同时最大化预期回报。虽然ABAA驱动AGAC探索程序生成的环境，但它会在训练期间影响探索和利用之间的平衡，从而损害AGAC的性能。为了减轻这种不利影响并提高AGAC的绩效，我们提出了有效的对抗引导行为批评（EAGAC）。EAGAC引入了一种基于状态的对抗优势（SBAA），它指导深度强化学习代理采取行动，导致与对手的行动分布不同的状态，同时最大化预期收益。EAGAC将这种SBAA与ABAA结合起来形成联合对抗优势，然后利用这种联合对抗优势对行动者进行更新。为了进一步减少这种不利影响并提高性能，EAGAC将过去的积极事件存储在回放缓冲区中，并利用从该缓冲区中采样的经验通过自我模仿学习（SIL）来优化actor。在MiniGrid程序生成环境和ViZDoom 3d导航环境中的实验结果表明，我们的EAGAC方法在样本效率和最终性能方面都明显优于AGAC和其他最先进的勘探方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Adversarially Guided Actor–Critic

Exploring procedurally-generated environments presents a formidable challenge in model-free deep reinforcement learning (RL). One state-of-the-art exploration method, adversarially guided actor–critic (AGAC), employs adversarial learning to drive exploration by diversifying the actions of the deep RL agent. Specifically, in the actor–critic (AC) framework, which consists of a policy (the actor) and a value function (the critic), AGAC introduces an adversary that mimics the actor. AGAC then constructs an action-based adversarial advantage (ABAA) to update the actor. This ABAA guides the deep RL agent toward actions that diverge from the adversary's predictions while maximizing expected returns. Although the ABAA drives AGAC to explore procedurally-generated environments, it can affect the balance between exploration and exploitation during the training period, thereby impairing AGAC's performance. To mitigate this adverse effect and improve AGAC's performance, we propose efficient adversarially guided actor–critic (EAGAC). EAGAC introduces a state-based adversarial advantage (SBAA) that directs the deep RL agent toward actions leading to states with different action distributions from those of the adversary while maximizing expected returns. EAGAC combines this SBAA with the ABAA to form a joint adversarial advantage, and then employs this joint adversarial advantage to update the actor. To further reduce this adverse effect and enhance performance, EAGAC stores past positive episodes in the replay buffer and utilizes experiences sampled from this buffer to optimize the actor through self-imitation learning (SIL). The experimental results in procedurally-generated environments from MiniGrid and the 3-D navigation environment from ViZDoom show our EAGAC method significantly outperforms AGAC and other state-of-the-art exploration methods in both sample efficiency and final performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Games Engineering-Electrical and Electronic Engineering

CiteScore

4.60

自引率

8.70%

发文量