Learning Individual Potential-Based Rewards in Multiagent Reinforcement Learning

IF 2.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Games Pub Date : 2024-08-29 DOI:10.1109/TG.2024.3450475

Chen Yang;Pei Xu;Junge Zhang

{"title":"Learning Individual Potential-Based Rewards in Multiagent Reinforcement Learning","authors":"Chen Yang;Pei Xu;Junge Zhang","doi":"10.1109/TG.2024.3450475","DOIUrl":null,"url":null,"abstract":"A great challenge for applying multiagent reinforcement learning (MARL) in the field of game artificial intelligence (AI) is to enable agents to learn diversified policies to handle different game-specific problems, while receiving only a shared team reward. At present, a common approach is reward shaping, which focuses on designing rewards for agents to guide cooperation. However, most of the existing methods require prior knowledge on the environment for reward design or alter the optimal policies after imposing extra rewards. Besides, previous MARL methods that rely on manually designed rewards can hardly generalize across different game environments. To this end, we propose a new MARL method that learns individual potential-based rewards for agents. Specifically, we learn a parameterized potential function for each agent to generate individual rewards in the discounted temporal difference form. The whole update procedure is modeled as the bilevel optimization problem, where the lower level is to optimize policies with potential-based rewards, and the upper level is to optimize parameterized potential functions toward maximizing the environment return. We theoretically prove that the individual potential-based rewards can guarantee policy invariance for agents, so that the optimization objective is consistent with the original MARL problem. We evaluate our method with a number of existing state-of-the-art MARL methods on predator–prey and <italic>StarCraft II</i> game environments. Empirical results show that our proposed method significantly outperforms baseline methods and achieves better game AI that enjoys high performance and generalization.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 2","pages":"334-345"},"PeriodicalIF":2.8000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10659352/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

A great challenge for applying multiagent reinforcement learning (MARL) in the field of game artificial intelligence (AI) is to enable agents to learn diversified policies to handle different game-specific problems, while receiving only a shared team reward. At present, a common approach is reward shaping, which focuses on designing rewards for agents to guide cooperation. However, most of the existing methods require prior knowledge on the environment for reward design or alter the optimal policies after imposing extra rewards. Besides, previous MARL methods that rely on manually designed rewards can hardly generalize across different game environments. To this end, we propose a new MARL method that learns individual potential-based rewards for agents. Specifically, we learn a parameterized potential function for each agent to generate individual rewards in the discounted temporal difference form. The whole update procedure is modeled as the bilevel optimization problem, where the lower level is to optimize policies with potential-based rewards, and the upper level is to optimize parameterized potential functions toward maximizing the environment return. We theoretically prove that the individual potential-based rewards can guarantee policy invariance for agents, so that the optimization objective is consistent with the original MARL problem. We evaluate our method with a number of existing state-of-the-art MARL methods on predator–prey and StarCraft II game environments. Empirical results show that our proposed method significantly outperforms baseline methods and achieves better game AI that enjoys high performance and generalization.

查看原文本刊更多论文

在多代理强化学习中学习基于个人潜能的奖励

在博弈人工智能（AI）领域应用多智能体强化学习（MARL）的一个巨大挑战是使智能体能够学习不同的策略来处理不同的博弈特定问题，同时只获得共享的团队奖励。目前常用的一种方法是奖励塑造，其重点是为代理设计奖励来引导合作。然而，现有的方法大多需要预先了解环境来进行奖励设计，或者在施加额外奖励后改变最优策略。此外，以前依赖于人工设计奖励的MARL方法很难推广到不同的游戏环境。为此，我们提出了一种新的MARL方法，该方法可以学习智能体的个体基于潜在的奖励。具体来说，我们学习了每个智能体的参数化势能函数，以贴现时间差形式产生个体奖励。整个更新过程被建模为双层优化问题，其中下层是优化基于潜在回报的策略，上层是优化参数化的潜在函数，以最大化环境回报。我们从理论上证明了基于个体潜力的奖励可以保证智能体的策略不变性，从而使优化目标与原MARL问题一致。我们用许多现有的最先进的MARL方法在捕食者-猎物和星际争霸II游戏环境中评估了我们的方法。实证结果表明，本文提出的方法明显优于基准方法，实现了高性能、泛化的更好的游戏AI。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Games Engineering-Electrical and Electronic Engineering

CiteScore

4.60

自引率

8.70%

发文量