Leaders and Collaborators: Addressing Sparse Reward Challenges in Multi-Agent Reinforcement Learning

IF 5.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Emerging Topics in Computational Intelligence Pub Date : 2024-11-12 DOI:10.1109/TETCI.2024.3488772

Shaoqi Sun;Hui Liu;Kele Xu;Bo Ding

{"title":"Leaders and Collaborators: Addressing Sparse Reward Challenges in Multi-Agent Reinforcement Learning","authors":"Shaoqi Sun;Hui Liu;Kele Xu;Bo Ding","doi":"10.1109/TETCI.2024.3488772","DOIUrl":null,"url":null,"abstract":"Cooperative multi-agent reinforcement learning (MARL) has emerged as an effective tool for addressing complex control tasks. However, sparse team rewards present significant challenges for MARL, leading to low exploration efficiency, slow learning speed, and homogenized behaviors among agents. To address these issues, we propose a novel Leader-Collaborator (LC) MARL framework inspired by human social collaboration. The LC framework introduces parallel online knowledge distillation for policy networks (KDPN). KDPN extracts knowledge from two policy networks with different training objectives: one aims to maximize individual rewards, while the other aims to maximize team rewards. The extracted knowledge is utilized to construct team leaders and collaborators. By effectively balancing individual and team rewards, our approach enhances exploration efficiency and promotes behavioral diversity among agents. This addresses the issue of low learning efficiency caused by the lack of objectives early in the agent's learning process and facilitates the development of more effective and differentiated team interaction policies. Additionally, we present the Self-Repairing Strategy (SRS) and Self-Augmenting Strategy (SAS) to facilitate team policies learning while preserving the initial team goal. We evaluate the effectiveness of the LC framework by conducting extensive experiments on the Multi-Agent Particle Environment (MPE), the Google Research Football (GRF), and StarCraft Multi-Agent Challenge (SMAC) with varying levels of difficulty. Our experimental results demonstrate that LC significantly improves the efficiency of the agent's exploration, achieves state-of-the-art performance, and accelerates the learning of the optimal policy. Specifically, in the SMAC scenarios, our method increases the winning rate by 21.9%, increases the average cumulative reward by 12%, and reduces the training time by 57% to achieve optimal performance.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"9 2","pages":"1976-1989"},"PeriodicalIF":5.3000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10750496/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Cooperative multi-agent reinforcement learning (MARL) has emerged as an effective tool for addressing complex control tasks. However, sparse team rewards present significant challenges for MARL, leading to low exploration efficiency, slow learning speed, and homogenized behaviors among agents. To address these issues, we propose a novel Leader-Collaborator (LC) MARL framework inspired by human social collaboration. The LC framework introduces parallel online knowledge distillation for policy networks (KDPN). KDPN extracts knowledge from two policy networks with different training objectives: one aims to maximize individual rewards, while the other aims to maximize team rewards. The extracted knowledge is utilized to construct team leaders and collaborators. By effectively balancing individual and team rewards, our approach enhances exploration efficiency and promotes behavioral diversity among agents. This addresses the issue of low learning efficiency caused by the lack of objectives early in the agent's learning process and facilitates the development of more effective and differentiated team interaction policies. Additionally, we present the Self-Repairing Strategy (SRS) and Self-Augmenting Strategy (SAS) to facilitate team policies learning while preserving the initial team goal. We evaluate the effectiveness of the LC framework by conducting extensive experiments on the Multi-Agent Particle Environment (MPE), the Google Research Football (GRF), and StarCraft Multi-Agent Challenge (SMAC) with varying levels of difficulty. Our experimental results demonstrate that LC significantly improves the efficiency of the agent's exploration, achieves state-of-the-art performance, and accelerates the learning of the optimal policy. Specifically, in the SMAC scenarios, our method increases the winning rate by 21.9%, increases the average cumulative reward by 12%, and reduces the training time by 57% to achieve optimal performance.

查看原文本刊更多论文

领导者和合作者：解决多智能体强化学习中的稀疏奖励挑战

协作式多智能体强化学习（MARL）已成为解决复杂控制任务的有效工具。然而，稀疏团队奖励给MARL带来了巨大的挑战，导致探索效率低，学习速度慢，智能体之间的行为同质化。为了解决这些问题，我们提出了一个受人类社会协作启发的新颖的领导者-合作者（LC） MARL框架。LC框架为策略网络引入并行在线知识蒸馏（KDPN）。KDPN从两个具有不同训练目标的政策网络中提取知识：一个目标是最大化个人奖励，另一个目标是最大化团队奖励。提取的知识被用来构建团队领导者和合作者。通过有效地平衡个人和团队奖励，我们的方法提高了探索效率，促进了代理之间的行为多样性。这解决了智能体在学习过程早期缺乏目标而导致的学习效率低的问题，有利于制定更有效和差异化的团队互动策略。此外，我们提出了自我修复策略（SRS）和自我增强策略（SAS）来促进团队策略学习，同时保持团队的初始目标。我们通过在不同难度的多智能体粒子环境（MPE）、谷歌研究足球（GRF）和星际争霸多智能体挑战（SMAC）上进行大量实验来评估LC框架的有效性。我们的实验结果表明，LC显著提高了智能体的探索效率，达到了最先进的性能，并加速了最优策略的学习。具体来说，在SMAC场景下，我们的方法提高了21.9%的胜率，提高了12%的平均累积奖励，减少了57%的训练时间，以达到最优的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Emerging Topics in Computational Intelligence Mathematics-Control and Optimization

CiteScore

10.30

自引率

7.50%

发文量

147

期刊介绍： The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.