{"title":"多智能体系统区域覆盖控制的时间差分学习","authors":"Farzan Soleymani, Md. Suruz Miah, D. Spinello","doi":"10.1109/ROSE56499.2022.9977412","DOIUrl":null,"url":null,"abstract":"We formulate an area coverage control problem with multi-agent systems by using Bellman's principle of optimality. The performance index is composed of the additive contributions of a term quadratic in the control effort and of a positive definite term that depends on the coverage metric. In this way, the reward encodes optimality in the sense of the classical Lloyd's algorithm, where the term depending on the coverage metric weights the energy of the state, and the term depending on the control weights the effort energy. Quasi optimality is achieved by an adaptive control policy using an actor-critic neural networks based reinforcement learning strategy, with quadratic function approximations for the value function and the control policy. Optimal configurations for a team of agents correspond to centroidal Voronoi partitions of the workspace, with each agent converging to the centroid of the respective generalized Voronoi cell. The system's dynamics is written in discrete time form, and the temporal difference form of Bellman's equation is used in the policy iteration learning scheme to train critic and actor weights. Remarkably, the obtained class of solutions is consistent with the one obtained with Lloyd's algorithm, with the advantage that the reinforcement learning formulation allows for a model-free implementation based on data measured along a system's trajectory. By storing an appropriate time history of control actions, the gradient of the value function is numerically approximated, allowing one to run the policy approximation without knowledge of the input dynamics. Direct comparisons with Lloyd's algorithm show the expected slower convergence since Lloyd's trajectories are optimal for the continuous time system.","PeriodicalId":265529,"journal":{"name":"2022 IEEE International Symposium on Robotic and Sensors Environments (ROSE)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Temporal Difference Learning of Area Coverage Control with Multi-Agent Systems\",\"authors\":\"Farzan Soleymani, Md. Suruz Miah, D. Spinello\",\"doi\":\"10.1109/ROSE56499.2022.9977412\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We formulate an area coverage control problem with multi-agent systems by using Bellman's principle of optimality. The performance index is composed of the additive contributions of a term quadratic in the control effort and of a positive definite term that depends on the coverage metric. In this way, the reward encodes optimality in the sense of the classical Lloyd's algorithm, where the term depending on the coverage metric weights the energy of the state, and the term depending on the control weights the effort energy. Quasi optimality is achieved by an adaptive control policy using an actor-critic neural networks based reinforcement learning strategy, with quadratic function approximations for the value function and the control policy. Optimal configurations for a team of agents correspond to centroidal Voronoi partitions of the workspace, with each agent converging to the centroid of the respective generalized Voronoi cell. The system's dynamics is written in discrete time form, and the temporal difference form of Bellman's equation is used in the policy iteration learning scheme to train critic and actor weights. Remarkably, the obtained class of solutions is consistent with the one obtained with Lloyd's algorithm, with the advantage that the reinforcement learning formulation allows for a model-free implementation based on data measured along a system's trajectory. By storing an appropriate time history of control actions, the gradient of the value function is numerically approximated, allowing one to run the policy approximation without knowledge of the input dynamics. Direct comparisons with Lloyd's algorithm show the expected slower convergence since Lloyd's trajectories are optimal for the continuous time system.\",\"PeriodicalId\":265529,\"journal\":{\"name\":\"2022 IEEE International Symposium on Robotic and Sensors Environments (ROSE)\",\"volume\":\"117 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Symposium on Robotic and Sensors Environments (ROSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ROSE56499.2022.9977412\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Robotic and Sensors Environments (ROSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ROSE56499.2022.9977412","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Temporal Difference Learning of Area Coverage Control with Multi-Agent Systems
We formulate an area coverage control problem with multi-agent systems by using Bellman's principle of optimality. The performance index is composed of the additive contributions of a term quadratic in the control effort and of a positive definite term that depends on the coverage metric. In this way, the reward encodes optimality in the sense of the classical Lloyd's algorithm, where the term depending on the coverage metric weights the energy of the state, and the term depending on the control weights the effort energy. Quasi optimality is achieved by an adaptive control policy using an actor-critic neural networks based reinforcement learning strategy, with quadratic function approximations for the value function and the control policy. Optimal configurations for a team of agents correspond to centroidal Voronoi partitions of the workspace, with each agent converging to the centroid of the respective generalized Voronoi cell. The system's dynamics is written in discrete time form, and the temporal difference form of Bellman's equation is used in the policy iteration learning scheme to train critic and actor weights. Remarkably, the obtained class of solutions is consistent with the one obtained with Lloyd's algorithm, with the advantage that the reinforcement learning formulation allows for a model-free implementation based on data measured along a system's trajectory. By storing an appropriate time history of control actions, the gradient of the value function is numerically approximated, allowing one to run the policy approximation without knowledge of the input dynamics. Direct comparisons with Lloyd's algorithm show the expected slower convergence since Lloyd's trajectories are optimal for the continuous time system.