Chenyang Zhu , Jinyu Zhu , Wen Si , Xueyuan Wang , Fang Wang
{"title":"Multi-agent reinforcement learning with synchronized and decomposed reward automaton synthesized from reactive temporal logic","authors":"Chenyang Zhu , Jinyu Zhu , Wen Si , Xueyuan Wang , Fang Wang","doi":"10.1016/j.knosys.2024.112703","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-agent systems (MAS) consist of multiple autonomous agents interacting to achieve collective objectives. Multi-agent reinforcement learning (MARL) enhances these systems by enabling agents to learn optimal behaviors through interaction, thus improving their coordination in dynamic environments. However, MARL faces significant challenges in adapting to complex dependencies on past states and actions, which are not adequately represented by the current state alone in reactive systems. This paper addresses these challenges by considering MAS operating under task specifications formulated as Generalized Reactivity of rank 1 (GR(1)). These synthesized strategies are used as a priori knowledge to guide the learning. To tackle the difficulties of handling non-Markovian tasks in reactive systems, we propose a novel synchronized decentralized training paradigm that guides agents to learn within the MARL framework using a reward structure constructed from decomposed synthesized strategies of GR(1). We initially formalize the synthesis of GR(1) strategies as a reachability problem of winning states of the system. Subsequently, we develop a decomposition mechanism that constructs individual reward structures for decentralized MARL, incorporating potential values calculated through value iteration. Theoretical proofs are provided to verify that the safety and liveness properties are preserved. We evaluate our approach against other state-of-the-art methods under various GR(1) specifications and scenario maps, demonstrating superior learning efficacy and optimal rewards per episode. Additionally, we show that the decentralized training paradigm outperforms the centralized training paradigm. The value iteration strategy used to calculate potential values for the reward structure is compared against two other strategies, showcasing its advantages.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"306 ","pages":"Article 112703"},"PeriodicalIF":7.2000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013376","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multi-agent systems (MAS) consist of multiple autonomous agents interacting to achieve collective objectives. Multi-agent reinforcement learning (MARL) enhances these systems by enabling agents to learn optimal behaviors through interaction, thus improving their coordination in dynamic environments. However, MARL faces significant challenges in adapting to complex dependencies on past states and actions, which are not adequately represented by the current state alone in reactive systems. This paper addresses these challenges by considering MAS operating under task specifications formulated as Generalized Reactivity of rank 1 (GR(1)). These synthesized strategies are used as a priori knowledge to guide the learning. To tackle the difficulties of handling non-Markovian tasks in reactive systems, we propose a novel synchronized decentralized training paradigm that guides agents to learn within the MARL framework using a reward structure constructed from decomposed synthesized strategies of GR(1). We initially formalize the synthesis of GR(1) strategies as a reachability problem of winning states of the system. Subsequently, we develop a decomposition mechanism that constructs individual reward structures for decentralized MARL, incorporating potential values calculated through value iteration. Theoretical proofs are provided to verify that the safety and liveness properties are preserved. We evaluate our approach against other state-of-the-art methods under various GR(1) specifications and scenario maps, demonstrating superior learning efficacy and optimal rewards per episode. Additionally, we show that the decentralized training paradigm outperforms the centralized training paradigm. The value iteration strategy used to calculate potential values for the reward structure is compared against two other strategies, showcasing its advantages.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.