概率Agent退出下多Agent mdp的无模型学习与最优策略设计

IF 5 3区计算机科学 Q2 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Control of Network Systems Pub Date : 2024-09-25 DOI:10.1109/TCNS.2024.3469031

Carmel Fiscko;Soummya Kar;Bruno Sinopoli

{"title":"概率Agent退出下多Agent mdp的无模型学习与最优策略设计","authors":"Carmel Fiscko;Soummya Kar;Bruno Sinopoli","doi":"10.1109/TCNS.2024.3469031","DOIUrl":null,"url":null,"abstract":"This work studies a multiagent Markov decision process (MDP) that can undergo agent dropout and the computation of policies for the postdropout system based on control and sampling of the predropout system. The central planner's objective is to find an optimal policy that maximizes the value of the expected system given a priori knowledge of the agents' dropout probabilities. For MDPs with a certain transition independence and reward separability structure, we assume that removing agents from the system forms a new MDP comprised of the remaining agents with new state and action spaces, transition dynamics that marginalize the removed agents, and rewards that are independent of the removed agents. We first show that under these assumptions, the value of the expected postdropout system can be represented by a single MDP; this “robust MDP” eliminates the need to evaluate all <inline-formula><tex-math>$2^{N}$</tex-math></inline-formula> realizations of the system, where <inline-formula><tex-math>$N$</tex-math></inline-formula> denotes the number of agents. More significantly, in a model-free context, it is shown that the robust MDP value can be estimated with samples generated by the predropout system, meaning that robust policies can be found before dropout occurs. This fact is used to propose a policy importance sampling (IS) routine that performs policy evaluation for dropout scenarios while controlling the existing system with good predropout policies. The policy IS routine produces value estimates for both the robust MDP and specific postdropout system realizations and is justified with exponential confidence bounds. Finally, the utility of this approach is verified in simulation, showing how structural properties of agent dropout can help a controller find good postdropout policies before dropout occurs.","PeriodicalId":56023,"journal":{"name":"IEEE Transactions on Control of Network Systems","volume":"12 1","pages":"361-373"},"PeriodicalIF":5.0000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Model-Free Learning and Optimal Policy Design in Multiagent MDPs Under Probabilistic Agent Dropout\",\"authors\":\"Carmel Fiscko;Soummya Kar;Bruno Sinopoli\",\"doi\":\"10.1109/TCNS.2024.3469031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work studies a multiagent Markov decision process (MDP) that can undergo agent dropout and the computation of policies for the postdropout system based on control and sampling of the predropout system. The central planner's objective is to find an optimal policy that maximizes the value of the expected system given a priori knowledge of the agents' dropout probabilities. For MDPs with a certain transition independence and reward separability structure, we assume that removing agents from the system forms a new MDP comprised of the remaining agents with new state and action spaces, transition dynamics that marginalize the removed agents, and rewards that are independent of the removed agents. We first show that under these assumptions, the value of the expected postdropout system can be represented by a single MDP; this “robust MDP” eliminates the need to evaluate all <inline-formula><tex-math>$2^{N}$</tex-math></inline-formula> realizations of the system, where <inline-formula><tex-math>$N$</tex-math></inline-formula> denotes the number of agents. More significantly, in a model-free context, it is shown that the robust MDP value can be estimated with samples generated by the predropout system, meaning that robust policies can be found before dropout occurs. This fact is used to propose a policy importance sampling (IS) routine that performs policy evaluation for dropout scenarios while controlling the existing system with good predropout policies. The policy IS routine produces value estimates for both the robust MDP and specific postdropout system realizations and is justified with exponential confidence bounds. Finally, the utility of this approach is verified in simulation, showing how structural properties of agent dropout can help a controller find good postdropout policies before dropout occurs.\",\"PeriodicalId\":56023,\"journal\":{\"name\":\"IEEE Transactions on Control of Network Systems\",\"volume\":\"12 1\",\"pages\":\"361-373\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Control of Network Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10694782/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Control of Network Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10694782/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

本文在对预退学系统进行控制和抽样的基础上，研究了一个可经历agent退学的多智能体马尔可夫决策过程（MDP）和后退学系统的策略计算。中央计划者的目标是在给定代理退出概率的先验知识的情况下，找到一个使预期系统价值最大化的最优策略。对于具有一定过渡独立性和奖励可分性结构的MDP，我们假设将agent从系统中移除形成一个新的MDP，该MDP由具有新状态和动作空间的剩余agent、边缘化被移除agent的过渡动态以及独立于被移除agent的奖励组成。我们首先证明，在这些假设下，期望辍学后系统的值可以用单个MDP表示；这种“健壮的MDP”不需要评估系统的所有$2^{N}$实现，其中$N$表示代理的数量。更重要的是，在无模型的上下文中，可以用预辍学系统生成的样本估计鲁棒MDP值，这意味着可以在辍学发生之前找到鲁棒策略。利用这一事实，提出了一种策略重要性抽样（is）例程，该例程在使用良好的预辍学策略控制现有系统的同时，对辍学场景进行策略评估。策略IS例程为鲁棒MDP和特定的辍学后系统实现生成值估计，并使用指数置信区间进行证明。最后，在仿真中验证了该方法的实用性，展示了智能体dropout的结构特性如何帮助控制器在dropout发生之前找到良好的postdropout策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Model-Free Learning and Optimal Policy Design in Multiagent MDPs Under Probabilistic Agent Dropout

This work studies a multiagent Markov decision process (MDP) that can undergo agent dropout and the computation of policies for the postdropout system based on control and sampling of the predropout system. The central planner's objective is to find an optimal policy that maximizes the value of the expected system given a priori knowledge of the agents' dropout probabilities. For MDPs with a certain transition independence and reward separability structure, we assume that removing agents from the system forms a new MDP comprised of the remaining agents with new state and action spaces, transition dynamics that marginalize the removed agents, and rewards that are independent of the removed agents. We first show that under these assumptions, the value of the expected postdropout system can be represented by a single MDP; this “robust MDP” eliminates the need to evaluate all

$2^{N}$

realizations of the system, where

$N$

denotes the number of agents. More significantly, in a model-free context, it is shown that the robust MDP value can be estimated with samples generated by the predropout system, meaning that robust policies can be found before dropout occurs. This fact is used to propose a policy importance sampling (IS) routine that performs policy evaluation for dropout scenarios while controlling the existing system with good predropout policies. The policy IS routine produces value estimates for both the robust MDP and specific postdropout system realizations and is justified with exponential confidence bounds. Finally, the utility of this approach is verified in simulation, showing how structural properties of agent dropout can help a controller find good postdropout policies before dropout occurs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Control of Network Systems Mathematics-Control and Optimization

CiteScore

7.80

自引率

7.10%

发文量

169

期刊介绍： The IEEE Transactions on Control of Network Systems is committed to the timely publication of high-impact papers at the intersection of control systems and network science. In particular, the journal addresses research on the analysis, design and implementation of networked control systems, as well as control over networks. Relevant work includes the full spectrum from basic research on control systems to the design of engineering solutions for automatic control of, and over, networks. The topics covered by this journal include: Coordinated control and estimation over networks, Control and computation over sensor networks, Control under communication constraints, Control and performance analysis issues that arise in the dynamics of networks used in application areas such as communications, computers, transportation, manufacturing, Web ranking and aggregation, social networks, biology, power systems, economics, Synchronization of activities across a controlled network, Stability analysis of controlled networks, Analysis of networks as hybrid dynamical systems.