{"title":"高效的逆向引导演员批评家","authors":"Mao Xu;Shuzhi Sam Ge;Dongjie Zhao;Qian Zhao","doi":"10.1109/TG.2024.3453444","DOIUrl":null,"url":null,"abstract":"Exploring procedurally-generated environments presents a formidable challenge in model-free deep reinforcement learning (RL). One state-of-the-art exploration method, adversarially guided actor–critic (AGAC), employs adversarial learning to drive exploration by diversifying the actions of the deep RL agent. Specifically, in the actor–critic (AC) framework, which consists of a policy (the actor) and a value function (the critic), AGAC introduces an adversary that mimics the actor. AGAC then constructs an action-based adversarial advantage (ABAA) to update the actor. This ABAA guides the deep RL agent toward actions that diverge from the adversary's predictions while maximizing expected returns. Although the ABAA drives AGAC to explore procedurally-generated environments, it can affect the balance between exploration and exploitation during the training period, thereby impairing AGAC's performance. To mitigate this adverse effect and improve AGAC's performance, we propose efficient adversarially guided actor–critic (EAGAC). EAGAC introduces a state-based adversarial advantage (SBAA) that directs the deep RL agent toward actions leading to states with different action distributions from those of the adversary while maximizing expected returns. EAGAC combines this SBAA with the ABAA to form a joint adversarial advantage, and then employs this joint adversarial advantage to update the actor. To further reduce this adverse effect and enhance performance, EAGAC stores past positive episodes in the replay buffer and utilizes experiences sampled from this buffer to optimize the actor through self-imitation learning (SIL). The experimental results in procedurally-generated environments from MiniGrid and the 3-D navigation environment from ViZDoom show our EAGAC method significantly outperforms AGAC and other state-of-the-art exploration methods in both sample efficiency and final performance.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 2","pages":"346-359"},"PeriodicalIF":2.8000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Adversarially Guided Actor–Critic\",\"authors\":\"Mao Xu;Shuzhi Sam Ge;Dongjie Zhao;Qian Zhao\",\"doi\":\"10.1109/TG.2024.3453444\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Exploring procedurally-generated environments presents a formidable challenge in model-free deep reinforcement learning (RL). One state-of-the-art exploration method, adversarially guided actor–critic (AGAC), employs adversarial learning to drive exploration by diversifying the actions of the deep RL agent. Specifically, in the actor–critic (AC) framework, which consists of a policy (the actor) and a value function (the critic), AGAC introduces an adversary that mimics the actor. AGAC then constructs an action-based adversarial advantage (ABAA) to update the actor. This ABAA guides the deep RL agent toward actions that diverge from the adversary's predictions while maximizing expected returns. Although the ABAA drives AGAC to explore procedurally-generated environments, it can affect the balance between exploration and exploitation during the training period, thereby impairing AGAC's performance. To mitigate this adverse effect and improve AGAC's performance, we propose efficient adversarially guided actor–critic (EAGAC). EAGAC introduces a state-based adversarial advantage (SBAA) that directs the deep RL agent toward actions leading to states with different action distributions from those of the adversary while maximizing expected returns. EAGAC combines this SBAA with the ABAA to form a joint adversarial advantage, and then employs this joint adversarial advantage to update the actor. To further reduce this adverse effect and enhance performance, EAGAC stores past positive episodes in the replay buffer and utilizes experiences sampled from this buffer to optimize the actor through self-imitation learning (SIL). The experimental results in procedurally-generated environments from MiniGrid and the 3-D navigation environment from ViZDoom show our EAGAC method significantly outperforms AGAC and other state-of-the-art exploration methods in both sample efficiency and final performance.\",\"PeriodicalId\":55977,\"journal\":{\"name\":\"IEEE Transactions on Games\",\"volume\":\"17 2\",\"pages\":\"346-359\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Games\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10663959/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10663959/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Exploring procedurally-generated environments presents a formidable challenge in model-free deep reinforcement learning (RL). One state-of-the-art exploration method, adversarially guided actor–critic (AGAC), employs adversarial learning to drive exploration by diversifying the actions of the deep RL agent. Specifically, in the actor–critic (AC) framework, which consists of a policy (the actor) and a value function (the critic), AGAC introduces an adversary that mimics the actor. AGAC then constructs an action-based adversarial advantage (ABAA) to update the actor. This ABAA guides the deep RL agent toward actions that diverge from the adversary's predictions while maximizing expected returns. Although the ABAA drives AGAC to explore procedurally-generated environments, it can affect the balance between exploration and exploitation during the training period, thereby impairing AGAC's performance. To mitigate this adverse effect and improve AGAC's performance, we propose efficient adversarially guided actor–critic (EAGAC). EAGAC introduces a state-based adversarial advantage (SBAA) that directs the deep RL agent toward actions leading to states with different action distributions from those of the adversary while maximizing expected returns. EAGAC combines this SBAA with the ABAA to form a joint adversarial advantage, and then employs this joint adversarial advantage to update the actor. To further reduce this adverse effect and enhance performance, EAGAC stores past positive episodes in the replay buffer and utilizes experiences sampled from this buffer to optimize the actor through self-imitation learning (SIL). The experimental results in procedurally-generated environments from MiniGrid and the 3-D navigation environment from ViZDoom show our EAGAC method significantly outperforms AGAC and other state-of-the-art exploration methods in both sample efficiency and final performance.