Counterfactual value decomposition for cooperative multi-agent reinforcement learning

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-06-16 DOI:10.1016/j.neunet.2025.107692

Kai Liu , Tianxian Zhang , Xiangliang Xu , Yuyang Zhao

{"title":"Counterfactual value decomposition for cooperative multi-agent reinforcement learning","authors":"Kai Liu , Tianxian Zhang , Xiangliang Xu , Yuyang Zhao","doi":"10.1016/j.neunet.2025.107692","DOIUrl":null,"url":null,"abstract":"<div><div>Value decomposition has become a central focus in Multi-Agent Reinforcement Learning (MARL) in recent years. The key challenge lies in the construction and updating of the factored value function (FVF). Traditional methods rely on FVFs with restricted representational capacity, rendering them inadequate for tasks with non-monotonic payoffs. Recent approaches address this limitation by designing FVF update mechanisms that enable applicability to non-monotonic scenarios. However, these methods typically depend on the true optimal joint action value to guide FVF updates. Since the true optimal joint action is computationally infeasible in practice, these methods approximate it using the greedy joint action and update the FVF with the corresponding greedy joint action value. We observe that although the greedy joint action may be close to the true optimal joint action, its associated greedy joint action value can be substantially biased relative to the true optimal joint action value. This makes the approximation unreliable and can lead to incorrect update directions for the FVF, hindering the learning process. To overcome this limitation, we propose Comix, a novel off-policy MARL method based on a Sandwich Value Decomposition Framework. Comix constrains and guides FVF updates using both upper and lower bounds. Specifically, it leverages orthogonal best responses to construct the upper bound, thus overcoming the drawbacks introduced by the optimal approximation. Furthermore, an attention mechanism is incorporated to ensure that the upper bound can be computed with linear time complexity and high accuracy. Theoretical analyses show that Comix satisfies the IGM. Experiments on the asymmetric One-Step Matrix Game, discrete Predator-Prey, and StarCraft Multi-Agent Challenge show that Comix achieves higher learning efficiency and outperforms several state-of-the-art methods.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107692"},"PeriodicalIF":6.0000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025005726","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Value decomposition has become a central focus in Multi-Agent Reinforcement Learning (MARL) in recent years. The key challenge lies in the construction and updating of the factored value function (FVF). Traditional methods rely on FVFs with restricted representational capacity, rendering them inadequate for tasks with non-monotonic payoffs. Recent approaches address this limitation by designing FVF update mechanisms that enable applicability to non-monotonic scenarios. However, these methods typically depend on the true optimal joint action value to guide FVF updates. Since the true optimal joint action is computationally infeasible in practice, these methods approximate it using the greedy joint action and update the FVF with the corresponding greedy joint action value. We observe that although the greedy joint action may be close to the true optimal joint action, its associated greedy joint action value can be substantially biased relative to the true optimal joint action value. This makes the approximation unreliable and can lead to incorrect update directions for the FVF, hindering the learning process. To overcome this limitation, we propose Comix, a novel off-policy MARL method based on a Sandwich Value Decomposition Framework. Comix constrains and guides FVF updates using both upper and lower bounds. Specifically, it leverages orthogonal best responses to construct the upper bound, thus overcoming the drawbacks introduced by the optimal approximation. Furthermore, an attention mechanism is incorporated to ensure that the upper bound can be computed with linear time complexity and high accuracy. Theoretical analyses show that Comix satisfies the IGM. Experiments on the asymmetric One-Step Matrix Game, discrete Predator-Prey, and StarCraft Multi-Agent Challenge show that Comix achieves higher learning efficiency and outperforms several state-of-the-art methods.

查看原文本刊更多论文

协同多智能体强化学习的反事实值分解

近年来，价值分解已成为多智能体强化学习（MARL）的研究热点。关键的挑战在于因子价值函数的构建和更新。传统方法依赖于具有有限表示能力的fvf，使得它们不适用于具有非单调收益的任务。最近的方法通过设计FVF更新机制来解决这一限制，使其能够适用于非单调场景。然而，这些方法通常依赖于真正的最优关节动作值来指导FVF更新。由于实际中真正的最优联合作用在计算上是不可行的，这些方法使用贪婪联合作用逼近它，并用相应的贪婪联合作用值更新FVF。我们观察到，尽管贪婪联合作用可能接近于真正的最优联合作用，但其关联的贪婪联合作用值相对于真正的最优联合作用值可能有很大的偏差。这使得近似不可靠，并可能导致不正确的FVF更新方向，阻碍学习过程。为了克服这一限制，我们提出了Comix，一种基于三明治值分解框架的新颖的非策略MARL方法。Comix使用上界和下界约束和引导FVF更新。具体来说，它利用正交最佳响应来构造上界，从而克服了最优逼近所带来的缺点。此外，还引入了注意机制，保证了上界的计算具有线性时间复杂度和较高的精度。理论分析表明，Comix满足IGM。在非对称一步矩阵博弈、离散捕食者-猎物和星际争霸多智能体挑战上的实验表明，Comix获得了更高的学习效率，并且优于几种最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.