基于最佳响应策略的分散式多代理强化学习

Frontiers in Robotics and AI Pub Date : 2024-04-16 DOI:10.3389/frobt.2024.1229026

Volker Gabler, Dirk Wollherr

{"title":"基于最佳响应策略的分散式多代理强化学习","authors":"Volker Gabler, Dirk Wollherr","doi":"10.3389/frobt.2024.1229026","DOIUrl":null,"url":null,"abstract":"Introduction: Multi-agent systems are an interdisciplinary research field that describes the concept of multiple decisive individuals interacting with a usually partially observable environment. Given the recent advances in single-agent reinforcement learning, multi-agent reinforcement learning (RL) has gained tremendous interest in recent years. Most research studies apply a fully centralized learning scheme to ease the transfer from the single-agent domain to multi-agent systems.Methods: In contrast, we claim that a decentralized learning scheme is preferable for applications in real-world scenarios as this allows deploying a learning algorithm on an individual robot rather than deploying the algorithm to a complete fleet of robots. Therefore, this article outlines a novel actor–critic (AC) approach tailored to cooperative MARL problems in sparsely rewarded domains. Our approach decouples the MARL problem into a set of distributed agents that model the other agents as responsive entities. In particular, we propose using two separate critics per agent to distinguish between the joint task reward and agent-based costs as commonly applied within multi-robot planning. On one hand, the agent-based critic intends to decrease agent-specific costs. On the other hand, each agent intends to optimize the joint team reward based on the joint task critic. As this critic still depends on the joint action of all agents, we outline two suitable behavior models based on Stackelberg games: a game against nature and a dyadic game against each agent. Following these behavior models, our algorithm allows fully decentralized execution and training.Results and Discussion: We evaluate our presented method using the proposed behavior models within a sparsely rewarded simulated multi-agent environment. Although our approach already outperforms the state-of-the-art learners, we conclude this article by outlining possible extensions of our algorithm that future research may build upon.","PeriodicalId":504612,"journal":{"name":"Frontiers in Robotics and AI","volume":"2 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Decentralized multi-agent reinforcement learning based on best-response policies\",\"authors\":\"Volker Gabler, Dirk Wollherr\",\"doi\":\"10.3389/frobt.2024.1229026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Multi-agent systems are an interdisciplinary research field that describes the concept of multiple decisive individuals interacting with a usually partially observable environment. Given the recent advances in single-agent reinforcement learning, multi-agent reinforcement learning (RL) has gained tremendous interest in recent years. Most research studies apply a fully centralized learning scheme to ease the transfer from the single-agent domain to multi-agent systems.Methods: In contrast, we claim that a decentralized learning scheme is preferable for applications in real-world scenarios as this allows deploying a learning algorithm on an individual robot rather than deploying the algorithm to a complete fleet of robots. Therefore, this article outlines a novel actor–critic (AC) approach tailored to cooperative MARL problems in sparsely rewarded domains. Our approach decouples the MARL problem into a set of distributed agents that model the other agents as responsive entities. In particular, we propose using two separate critics per agent to distinguish between the joint task reward and agent-based costs as commonly applied within multi-robot planning. On one hand, the agent-based critic intends to decrease agent-specific costs. On the other hand, each agent intends to optimize the joint team reward based on the joint task critic. As this critic still depends on the joint action of all agents, we outline two suitable behavior models based on Stackelberg games: a game against nature and a dyadic game against each agent. Following these behavior models, our algorithm allows fully decentralized execution and training.Results and Discussion: We evaluate our presented method using the proposed behavior models within a sparsely rewarded simulated multi-agent environment. Although our approach already outperforms the state-of-the-art learners, we conclude this article by outlining possible extensions of our algorithm that future research may build upon.\",\"PeriodicalId\":504612,\"journal\":{\"name\":\"Frontiers in Robotics and AI\",\"volume\":\"2 2\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Robotics and AI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/frobt.2024.1229026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Robotics and AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frobt.2024.1229026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

引言多代理系统是一个跨学科的研究领域，它描述了多个起决定性作用的个体与通常可部分观察到的环境相互作用的概念。鉴于单机强化学习的最新进展，多机强化学习（RL）近年来获得了极大的关注。大多数研究都采用了完全集中式的学习方案，以方便从单机器人领域转移到多机器人系统：相比之下，我们认为分散式学习方案更适合在现实世界中应用，因为这样可以在单个机器人上部署学习算法，而不是将算法部署到整个机器人群中。因此，本文概述了一种新颖的 "行动者批判"（AC）方法，该方法专门针对奖励稀少领域中的合作性 MARL 问题。我们的方法将 MARL 问题解耦为一组分布式代理，将其他代理建模为响应实体。特别是，我们建议每个代理使用两个独立的批判器，以区分联合任务奖励和基于代理的成本（通常应用于多机器人规划）。一方面，基于代理的批判旨在降低代理的具体成本。另一方面，每个代理打算根据联合任务批判器优化团队联合奖励。由于这种批判仍依赖于所有代理的联合行动，我们基于斯塔克尔伯格博弈概述了两种合适的行为模型：一种是与自然的博弈，另一种是与每个代理的双人博弈。根据这些行为模型，我们的算法可以实现完全分散的执行和训练：我们在奖励稀少的模拟多代理环境中使用所提出的行为模型对我们提出的方法进行了评估。虽然我们的方法已经超越了最先进的学习者，但我们在本文的最后概述了我们算法的可能扩展，未来的研究可能会在此基础上进行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Decentralized multi-agent reinforcement learning based on best-response policies

Introduction: Multi-agent systems are an interdisciplinary research field that describes the concept of multiple decisive individuals interacting with a usually partially observable environment. Given the recent advances in single-agent reinforcement learning, multi-agent reinforcement learning (RL) has gained tremendous interest in recent years. Most research studies apply a fully centralized learning scheme to ease the transfer from the single-agent domain to multi-agent systems.Methods: In contrast, we claim that a decentralized learning scheme is preferable for applications in real-world scenarios as this allows deploying a learning algorithm on an individual robot rather than deploying the algorithm to a complete fleet of robots. Therefore, this article outlines a novel actor–critic (AC) approach tailored to cooperative MARL problems in sparsely rewarded domains. Our approach decouples the MARL problem into a set of distributed agents that model the other agents as responsive entities. In particular, we propose using two separate critics per agent to distinguish between the joint task reward and agent-based costs as commonly applied within multi-robot planning. On one hand, the agent-based critic intends to decrease agent-specific costs. On the other hand, each agent intends to optimize the joint team reward based on the joint task critic. As this critic still depends on the joint action of all agents, we outline two suitable behavior models based on Stackelberg games: a game against nature and a dyadic game against each agent. Following these behavior models, our algorithm allows fully decentralized execution and training.Results and Discussion: We evaluate our presented method using the proposed behavior models within a sparsely rewarded simulated multi-agent environment. Although our approach already outperforms the state-of-the-art learners, we conclude this article by outlining possible extensions of our algorithm that future research may build upon.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in Robotics and AI

自引率

0.00%

发文量