Multi-agent reinforcement learning with synchronized and decomposed reward automaton synthesized from reactive temporal logic

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2024-11-12 DOI:10.1016/j.knosys.2024.112703

Chenyang Zhu , Jinyu Zhu , Wen Si , Xueyuan Wang , Fang Wang

{"title":"Multi-agent reinforcement learning with synchronized and decomposed reward automaton synthesized from reactive temporal logic","authors":"Chenyang Zhu , Jinyu Zhu , Wen Si , Xueyuan Wang , Fang Wang","doi":"10.1016/j.knosys.2024.112703","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-agent systems (MAS) consist of multiple autonomous agents interacting to achieve collective objectives. Multi-agent reinforcement learning (MARL) enhances these systems by enabling agents to learn optimal behaviors through interaction, thus improving their coordination in dynamic environments. However, MARL faces significant challenges in adapting to complex dependencies on past states and actions, which are not adequately represented by the current state alone in reactive systems. This paper addresses these challenges by considering MAS operating under task specifications formulated as Generalized Reactivity of rank 1 (GR(1)). These synthesized strategies are used as a priori knowledge to guide the learning. To tackle the difficulties of handling non-Markovian tasks in reactive systems, we propose a novel synchronized decentralized training paradigm that guides agents to learn within the MARL framework using a reward structure constructed from decomposed synthesized strategies of GR(1). We initially formalize the synthesis of GR(1) strategies as a reachability problem of winning states of the system. Subsequently, we develop a decomposition mechanism that constructs individual reward structures for decentralized MARL, incorporating potential values calculated through value iteration. Theoretical proofs are provided to verify that the safety and liveness properties are preserved. We evaluate our approach against other state-of-the-art methods under various GR(1) specifications and scenario maps, demonstrating superior learning efficacy and optimal rewards per episode. Additionally, we show that the decentralized training paradigm outperforms the centralized training paradigm. The value iteration strategy used to calculate potential values for the reward structure is compared against two other strategies, showcasing its advantages.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"306 ","pages":"Article 112703"},"PeriodicalIF":7.2000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013376","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-agent systems (MAS) consist of multiple autonomous agents interacting to achieve collective objectives. Multi-agent reinforcement learning (MARL) enhances these systems by enabling agents to learn optimal behaviors through interaction, thus improving their coordination in dynamic environments. However, MARL faces significant challenges in adapting to complex dependencies on past states and actions, which are not adequately represented by the current state alone in reactive systems. This paper addresses these challenges by considering MAS operating under task specifications formulated as Generalized Reactivity of rank 1 (GR(1)). These synthesized strategies are used as a priori knowledge to guide the learning. To tackle the difficulties of handling non-Markovian tasks in reactive systems, we propose a novel synchronized decentralized training paradigm that guides agents to learn within the MARL framework using a reward structure constructed from decomposed synthesized strategies of GR(1). We initially formalize the synthesis of GR(1) strategies as a reachability problem of winning states of the system. Subsequently, we develop a decomposition mechanism that constructs individual reward structures for decentralized MARL, incorporating potential values calculated through value iteration. Theoretical proofs are provided to verify that the safety and liveness properties are preserved. We evaluate our approach against other state-of-the-art methods under various GR(1) specifications and scenario maps, demonstrating superior learning efficacy and optimal rewards per episode. Additionally, we show that the decentralized training paradigm outperforms the centralized training paradigm. The value iteration strategy used to calculate potential values for the reward structure is compared against two other strategies, showcasing its advantages.

查看原文本刊更多论文

利用由反应时态逻辑合成的同步和分解奖励自动机进行多代理强化学习

多代理系统（MAS）由多个自主代理组成，通过互动实现集体目标。多代理强化学习（MARL）可使代理通过互动学习最佳行为，从而改善它们在动态环境中的协调，从而增强这些系统的功能。然而，MARL 在适应过去状态和行动的复杂依赖性方面面临巨大挑战，而在反应式系统中，仅靠当前状态并不能充分体现这些依赖性。本文通过考虑在任务规范下运行的 MAS，以等级 1 的广义反应性（GR(1)）来应对这些挑战。这些综合策略被用作指导学习的先验知识。为了解决在反应式系统中处理非马尔可夫任务的困难，我们提出了一种新颖的同步分散训练范式，利用由 GR(1) 的分解合成策略构建的奖励结构，指导代理在 MARL 框架内学习。我们首先将 GR(1) 策略的合成形式化为系统获胜状态的可达性问题。随后，我们开发了一种分解机制，为分散式 MARL 构建单个奖励结构，并将通过价值迭代计算出的潜在价值纳入其中。我们提供了理论证明，以验证安全性和有效性得到了保留。我们根据不同的 GR(1) 规范和场景图，对我们的方法与其他最先进的方法进行了评估，结果表明我们的方法具有更高的学习效率和每集最佳奖励。此外，我们还证明分散训练范式优于集中训练范式。我们将用于计算奖励结构潜在值的价值迭代策略与其他两种策略进行了比较，从而展示了其优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.