{"title":"Leaders and Collaborators: Addressing Sparse Reward Challenges in Multi-Agent Reinforcement Learning","authors":"Shaoqi Sun;Hui Liu;Kele Xu;Bo Ding","doi":"10.1109/TETCI.2024.3488772","DOIUrl":null,"url":null,"abstract":"Cooperative multi-agent reinforcement learning (MARL) has emerged as an effective tool for addressing complex control tasks. However, sparse team rewards present significant challenges for MARL, leading to low exploration efficiency, slow learning speed, and homogenized behaviors among agents. To address these issues, we propose a novel Leader-Collaborator (LC) MARL framework inspired by human social collaboration. The LC framework introduces parallel online knowledge distillation for policy networks (KDPN). KDPN extracts knowledge from two policy networks with different training objectives: one aims to maximize individual rewards, while the other aims to maximize team rewards. The extracted knowledge is utilized to construct team leaders and collaborators. By effectively balancing individual and team rewards, our approach enhances exploration efficiency and promotes behavioral diversity among agents. This addresses the issue of low learning efficiency caused by the lack of objectives early in the agent's learning process and facilitates the development of more effective and differentiated team interaction policies. Additionally, we present the Self-Repairing Strategy (SRS) and Self-Augmenting Strategy (SAS) to facilitate team policies learning while preserving the initial team goal. We evaluate the effectiveness of the LC framework by conducting extensive experiments on the Multi-Agent Particle Environment (MPE), the Google Research Football (GRF), and StarCraft Multi-Agent Challenge (SMAC) with varying levels of difficulty. Our experimental results demonstrate that LC significantly improves the efficiency of the agent's exploration, achieves state-of-the-art performance, and accelerates the learning of the optimal policy. Specifically, in the SMAC scenarios, our method increases the winning rate by 21.9%, increases the average cumulative reward by 12%, and reduces the training time by 57% to achieve optimal performance.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"9 2","pages":"1976-1989"},"PeriodicalIF":5.3000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10750496/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Cooperative multi-agent reinforcement learning (MARL) has emerged as an effective tool for addressing complex control tasks. However, sparse team rewards present significant challenges for MARL, leading to low exploration efficiency, slow learning speed, and homogenized behaviors among agents. To address these issues, we propose a novel Leader-Collaborator (LC) MARL framework inspired by human social collaboration. The LC framework introduces parallel online knowledge distillation for policy networks (KDPN). KDPN extracts knowledge from two policy networks with different training objectives: one aims to maximize individual rewards, while the other aims to maximize team rewards. The extracted knowledge is utilized to construct team leaders and collaborators. By effectively balancing individual and team rewards, our approach enhances exploration efficiency and promotes behavioral diversity among agents. This addresses the issue of low learning efficiency caused by the lack of objectives early in the agent's learning process and facilitates the development of more effective and differentiated team interaction policies. Additionally, we present the Self-Repairing Strategy (SRS) and Self-Augmenting Strategy (SAS) to facilitate team policies learning while preserving the initial team goal. We evaluate the effectiveness of the LC framework by conducting extensive experiments on the Multi-Agent Particle Environment (MPE), the Google Research Football (GRF), and StarCraft Multi-Agent Challenge (SMAC) with varying levels of difficulty. Our experimental results demonstrate that LC significantly improves the efficiency of the agent's exploration, achieves state-of-the-art performance, and accelerates the learning of the optimal policy. Specifically, in the SMAC scenarios, our method increases the winning rate by 21.9%, increases the average cumulative reward by 12%, and reduces the training time by 57% to achieve optimal performance.
期刊介绍:
The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys.
TETCI is an electronics only publication. TETCI publishes six issues per year.
Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.