Meta Proximal Policy Optimization for Cooperative Multi-Agent Continuous Control

2022 International Joint Conference on Neural Networks (IJCNN) Pub Date : 2022-07-18 DOI:10.1109/IJCNN55064.2022.9892004

Boli Fang, Zhenghao Peng, Hao Sun, Qin Zhang

{"title":"Meta Proximal Policy Optimization for Cooperative Multi-Agent Continuous Control","authors":"Boli Fang, Zhenghao Peng, Hao Sun, Qin Zhang","doi":"10.1109/IJCNN55064.2022.9892004","DOIUrl":null,"url":null,"abstract":"In this paper we propose Multi-Agent Proxy Proximal Policy Optimization (MA3PO), a novel multi-agent deep reinforcement learning algorithm that tackles the challenge of cooperative continuous multi-agent control. Our method is driven by the observation that most existing multi-agent reinforcement learning algorithms mainly focus on discrete state/action spaces and are thus computationally infeasible when extended to environments with continuous state/action spaces. To address the issue of computational complexity and to better model intra-agent collaboration, we make use of the recently successful Proximal Policy Optimization algorithm that effectively explores of continuous action spaces, and incorporate the notion of intrinsic motivation via meta-gradient methods so as to stimulate the behavior of individual agents in cooperative multi-agent settings. Towards these ends, we design proxy rewards to quantify the effect of individual agent-level intrinsic motivation onto the team-level reward, and apply meta-gradient methods to leverage such an addition so that our algorithm can learn the team-level cumulative reward effectively. Experiments on various multi-agent reinforcement learning benchmark environments with continuous action spaces demonstrate that our algorithm is not only comparable with the existing state-of-the-art benchmarks, but also significantly reduces training time complexity.","PeriodicalId":106974,"journal":{"name":"2022 International Joint Conference on Neural Networks (IJCNN)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN55064.2022.9892004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper we propose Multi-Agent Proxy Proximal Policy Optimization (MA3PO), a novel multi-agent deep reinforcement learning algorithm that tackles the challenge of cooperative continuous multi-agent control. Our method is driven by the observation that most existing multi-agent reinforcement learning algorithms mainly focus on discrete state/action spaces and are thus computationally infeasible when extended to environments with continuous state/action spaces. To address the issue of computational complexity and to better model intra-agent collaboration, we make use of the recently successful Proximal Policy Optimization algorithm that effectively explores of continuous action spaces, and incorporate the notion of intrinsic motivation via meta-gradient methods so as to stimulate the behavior of individual agents in cooperative multi-agent settings. Towards these ends, we design proxy rewards to quantify the effect of individual agent-level intrinsic motivation onto the team-level reward, and apply meta-gradient methods to leverage such an addition so that our algorithm can learn the team-level cumulative reward effectively. Experiments on various multi-agent reinforcement learning benchmark environments with continuous action spaces demonstrate that our algorithm is not only comparable with the existing state-of-the-art benchmarks, but also significantly reduces training time complexity.

查看原文本刊更多论文

协同多智能体连续控制的元近端策略优化

本文提出了一种新的多智能体深度强化学习算法——多智能体代理近端策略优化算法(MA3PO)，解决了多智能体协作连续控制的难题。我们的方法是由观察到大多数现有的多智能体强化学习算法主要关注离散状态/动作空间，因此当扩展到具有连续状态/动作空间的环境时，计算上是不可行的。为了解决计算复杂性问题并更好地模拟智能体内部协作，我们利用最近成功的邻域策略优化算法，该算法有效地探索了连续的动作空间，并通过元梯度方法引入了内在动机的概念，从而在多智能体协作设置中刺激个体智能体的行为。为此，我们设计代理奖励来量化个体代理级内在动机对团队级奖励的影响，并应用元梯度方法来利用这种附加，以便我们的算法可以有效地学习团队级累积奖励。在具有连续动作空间的各种多智能体强化学习基准环境中进行的实验表明，我们的算法不仅可以与现有的最先进的基准相媲美，而且可以显着降低训练时间复杂度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Joint Conference on Neural Networks (IJCNN)

自引率

0.00%

发文量