{"title":"概率Agent退出下多Agent mdp的无模型学习与最优策略设计","authors":"Carmel Fiscko;Soummya Kar;Bruno Sinopoli","doi":"10.1109/TCNS.2024.3469031","DOIUrl":null,"url":null,"abstract":"This work studies a multiagent Markov decision process (MDP) that can undergo agent dropout and the computation of policies for the postdropout system based on control and sampling of the predropout system. The central planner's objective is to find an optimal policy that maximizes the value of the expected system given a priori knowledge of the agents' dropout probabilities. For MDPs with a certain transition independence and reward separability structure, we assume that removing agents from the system forms a new MDP comprised of the remaining agents with new state and action spaces, transition dynamics that marginalize the removed agents, and rewards that are independent of the removed agents. We first show that under these assumptions, the value of the expected postdropout system can be represented by a single MDP; this “robust MDP” eliminates the need to evaluate all <inline-formula><tex-math>$2^{N}$</tex-math></inline-formula> realizations of the system, where <inline-formula><tex-math>$N$</tex-math></inline-formula> denotes the number of agents. More significantly, in a model-free context, it is shown that the robust MDP value can be estimated with samples generated by the predropout system, meaning that robust policies can be found before dropout occurs. This fact is used to propose a policy importance sampling (IS) routine that performs policy evaluation for dropout scenarios while controlling the existing system with good predropout policies. The policy IS routine produces value estimates for both the robust MDP and specific postdropout system realizations and is justified with exponential confidence bounds. Finally, the utility of this approach is verified in simulation, showing how structural properties of agent dropout can help a controller find good postdropout policies before dropout occurs.","PeriodicalId":56023,"journal":{"name":"IEEE Transactions on Control of Network Systems","volume":"12 1","pages":"361-373"},"PeriodicalIF":5.0000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Model-Free Learning and Optimal Policy Design in Multiagent MDPs Under Probabilistic Agent Dropout\",\"authors\":\"Carmel Fiscko;Soummya Kar;Bruno Sinopoli\",\"doi\":\"10.1109/TCNS.2024.3469031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work studies a multiagent Markov decision process (MDP) that can undergo agent dropout and the computation of policies for the postdropout system based on control and sampling of the predropout system. The central planner's objective is to find an optimal policy that maximizes the value of the expected system given a priori knowledge of the agents' dropout probabilities. For MDPs with a certain transition independence and reward separability structure, we assume that removing agents from the system forms a new MDP comprised of the remaining agents with new state and action spaces, transition dynamics that marginalize the removed agents, and rewards that are independent of the removed agents. We first show that under these assumptions, the value of the expected postdropout system can be represented by a single MDP; this “robust MDP” eliminates the need to evaluate all <inline-formula><tex-math>$2^{N}$</tex-math></inline-formula> realizations of the system, where <inline-formula><tex-math>$N$</tex-math></inline-formula> denotes the number of agents. More significantly, in a model-free context, it is shown that the robust MDP value can be estimated with samples generated by the predropout system, meaning that robust policies can be found before dropout occurs. This fact is used to propose a policy importance sampling (IS) routine that performs policy evaluation for dropout scenarios while controlling the existing system with good predropout policies. The policy IS routine produces value estimates for both the robust MDP and specific postdropout system realizations and is justified with exponential confidence bounds. Finally, the utility of this approach is verified in simulation, showing how structural properties of agent dropout can help a controller find good postdropout policies before dropout occurs.\",\"PeriodicalId\":56023,\"journal\":{\"name\":\"IEEE Transactions on Control of Network Systems\",\"volume\":\"12 1\",\"pages\":\"361-373\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Control of Network Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10694782/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Control of Network Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10694782/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Model-Free Learning and Optimal Policy Design in Multiagent MDPs Under Probabilistic Agent Dropout
This work studies a multiagent Markov decision process (MDP) that can undergo agent dropout and the computation of policies for the postdropout system based on control and sampling of the predropout system. The central planner's objective is to find an optimal policy that maximizes the value of the expected system given a priori knowledge of the agents' dropout probabilities. For MDPs with a certain transition independence and reward separability structure, we assume that removing agents from the system forms a new MDP comprised of the remaining agents with new state and action spaces, transition dynamics that marginalize the removed agents, and rewards that are independent of the removed agents. We first show that under these assumptions, the value of the expected postdropout system can be represented by a single MDP; this “robust MDP” eliminates the need to evaluate all $2^{N}$ realizations of the system, where $N$ denotes the number of agents. More significantly, in a model-free context, it is shown that the robust MDP value can be estimated with samples generated by the predropout system, meaning that robust policies can be found before dropout occurs. This fact is used to propose a policy importance sampling (IS) routine that performs policy evaluation for dropout scenarios while controlling the existing system with good predropout policies. The policy IS routine produces value estimates for both the robust MDP and specific postdropout system realizations and is justified with exponential confidence bounds. Finally, the utility of this approach is verified in simulation, showing how structural properties of agent dropout can help a controller find good postdropout policies before dropout occurs.
期刊介绍:
The IEEE Transactions on Control of Network Systems is committed to the timely publication of high-impact papers at the intersection of control systems and network science. In particular, the journal addresses research on the analysis, design and implementation of networked control systems, as well as control over networks. Relevant work includes the full spectrum from basic research on control systems to the design of engineering solutions for automatic control of, and over, networks. The topics covered by this journal include: Coordinated control and estimation over networks, Control and computation over sensor networks, Control under communication constraints, Control and performance analysis issues that arise in the dynamics of networks used in application areas such as communications, computers, transportation, manufacturing, Web ranking and aggregation, social networks, biology, power systems, economics, Synchronization of activities across a controlled network, Stability analysis of controlled networks, Analysis of networks as hybrid dynamical systems.