利用由反应时态逻辑合成的同步和分解奖励自动机进行多代理强化学习

IF 7.2 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Chenyang Zhu , Jinyu Zhu , Wen Si , Xueyuan Wang , Fang Wang
{"title":"利用由反应时态逻辑合成的同步和分解奖励自动机进行多代理强化学习","authors":"Chenyang Zhu ,&nbsp;Jinyu Zhu ,&nbsp;Wen Si ,&nbsp;Xueyuan Wang ,&nbsp;Fang Wang","doi":"10.1016/j.knosys.2024.112703","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-agent systems (MAS) consist of multiple autonomous agents interacting to achieve collective objectives. Multi-agent reinforcement learning (MARL) enhances these systems by enabling agents to learn optimal behaviors through interaction, thus improving their coordination in dynamic environments. However, MARL faces significant challenges in adapting to complex dependencies on past states and actions, which are not adequately represented by the current state alone in reactive systems. This paper addresses these challenges by considering MAS operating under task specifications formulated as Generalized Reactivity of rank 1 (GR(1)). These synthesized strategies are used as a priori knowledge to guide the learning. To tackle the difficulties of handling non-Markovian tasks in reactive systems, we propose a novel synchronized decentralized training paradigm that guides agents to learn within the MARL framework using a reward structure constructed from decomposed synthesized strategies of GR(1). We initially formalize the synthesis of GR(1) strategies as a reachability problem of winning states of the system. Subsequently, we develop a decomposition mechanism that constructs individual reward structures for decentralized MARL, incorporating potential values calculated through value iteration. Theoretical proofs are provided to verify that the safety and liveness properties are preserved. We evaluate our approach against other state-of-the-art methods under various GR(1) specifications and scenario maps, demonstrating superior learning efficacy and optimal rewards per episode. Additionally, we show that the decentralized training paradigm outperforms the centralized training paradigm. The value iteration strategy used to calculate potential values for the reward structure is compared against two other strategies, showcasing its advantages.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"306 ","pages":"Article 112703"},"PeriodicalIF":7.2000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-agent reinforcement learning with synchronized and decomposed reward automaton synthesized from reactive temporal logic\",\"authors\":\"Chenyang Zhu ,&nbsp;Jinyu Zhu ,&nbsp;Wen Si ,&nbsp;Xueyuan Wang ,&nbsp;Fang Wang\",\"doi\":\"10.1016/j.knosys.2024.112703\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-agent systems (MAS) consist of multiple autonomous agents interacting to achieve collective objectives. Multi-agent reinforcement learning (MARL) enhances these systems by enabling agents to learn optimal behaviors through interaction, thus improving their coordination in dynamic environments. However, MARL faces significant challenges in adapting to complex dependencies on past states and actions, which are not adequately represented by the current state alone in reactive systems. This paper addresses these challenges by considering MAS operating under task specifications formulated as Generalized Reactivity of rank 1 (GR(1)). These synthesized strategies are used as a priori knowledge to guide the learning. To tackle the difficulties of handling non-Markovian tasks in reactive systems, we propose a novel synchronized decentralized training paradigm that guides agents to learn within the MARL framework using a reward structure constructed from decomposed synthesized strategies of GR(1). We initially formalize the synthesis of GR(1) strategies as a reachability problem of winning states of the system. Subsequently, we develop a decomposition mechanism that constructs individual reward structures for decentralized MARL, incorporating potential values calculated through value iteration. Theoretical proofs are provided to verify that the safety and liveness properties are preserved. We evaluate our approach against other state-of-the-art methods under various GR(1) specifications and scenario maps, demonstrating superior learning efficacy and optimal rewards per episode. Additionally, we show that the decentralized training paradigm outperforms the centralized training paradigm. The value iteration strategy used to calculate potential values for the reward structure is compared against two other strategies, showcasing its advantages.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"306 \",\"pages\":\"Article 112703\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2024-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705124013376\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013376","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

多代理系统(MAS)由多个自主代理组成,通过互动实现集体目标。多代理强化学习(MARL)可使代理通过互动学习最佳行为,从而改善它们在动态环境中的协调,从而增强这些系统的功能。然而,MARL 在适应过去状态和行动的复杂依赖性方面面临巨大挑战,而在反应式系统中,仅靠当前状态并不能充分体现这些依赖性。本文通过考虑在任务规范下运行的 MAS,以等级 1 的广义反应性(GR(1))来应对这些挑战。这些综合策略被用作指导学习的先验知识。为了解决在反应式系统中处理非马尔可夫任务的困难,我们提出了一种新颖的同步分散训练范式,利用由 GR(1) 的分解合成策略构建的奖励结构,指导代理在 MARL 框架内学习。我们首先将 GR(1) 策略的合成形式化为系统获胜状态的可达性问题。随后,我们开发了一种分解机制,为分散式 MARL 构建单个奖励结构,并将通过价值迭代计算出的潜在价值纳入其中。我们提供了理论证明,以验证安全性和有效性得到了保留。我们根据不同的 GR(1) 规范和场景图,对我们的方法与其他最先进的方法进行了评估,结果表明我们的方法具有更高的学习效率和每集最佳奖励。此外,我们还证明分散训练范式优于集中训练范式。我们将用于计算奖励结构潜在值的价值迭代策略与其他两种策略进行了比较,从而展示了其优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multi-agent reinforcement learning with synchronized and decomposed reward automaton synthesized from reactive temporal logic
Multi-agent systems (MAS) consist of multiple autonomous agents interacting to achieve collective objectives. Multi-agent reinforcement learning (MARL) enhances these systems by enabling agents to learn optimal behaviors through interaction, thus improving their coordination in dynamic environments. However, MARL faces significant challenges in adapting to complex dependencies on past states and actions, which are not adequately represented by the current state alone in reactive systems. This paper addresses these challenges by considering MAS operating under task specifications formulated as Generalized Reactivity of rank 1 (GR(1)). These synthesized strategies are used as a priori knowledge to guide the learning. To tackle the difficulties of handling non-Markovian tasks in reactive systems, we propose a novel synchronized decentralized training paradigm that guides agents to learn within the MARL framework using a reward structure constructed from decomposed synthesized strategies of GR(1). We initially formalize the synthesis of GR(1) strategies as a reachability problem of winning states of the system. Subsequently, we develop a decomposition mechanism that constructs individual reward structures for decentralized MARL, incorporating potential values calculated through value iteration. Theoretical proofs are provided to verify that the safety and liveness properties are preserved. We evaluate our approach against other state-of-the-art methods under various GR(1) specifications and scenario maps, demonstrating superior learning efficacy and optimal rewards per episode. Additionally, we show that the decentralized training paradigm outperforms the centralized training paradigm. The value iteration strategy used to calculate potential values for the reward structure is compared against two other strategies, showcasing its advantages.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信