Temporal Difference Learning of Area Coverage Control with Multi-Agent Systems

2022 IEEE International Symposium on Robotic and Sensors Environments (ROSE) Pub Date : 2022-11-14 DOI:10.1109/ROSE56499.2022.9977412

Farzan Soleymani, Md. Suruz Miah, D. Spinello

{"title":"Temporal Difference Learning of Area Coverage Control with Multi-Agent Systems","authors":"Farzan Soleymani, Md. Suruz Miah, D. Spinello","doi":"10.1109/ROSE56499.2022.9977412","DOIUrl":null,"url":null,"abstract":"We formulate an area coverage control problem with multi-agent systems by using Bellman's principle of optimality. The performance index is composed of the additive contributions of a term quadratic in the control effort and of a positive definite term that depends on the coverage metric. In this way, the reward encodes optimality in the sense of the classical Lloyd's algorithm, where the term depending on the coverage metric weights the energy of the state, and the term depending on the control weights the effort energy. Quasi optimality is achieved by an adaptive control policy using an actor-critic neural networks based reinforcement learning strategy, with quadratic function approximations for the value function and the control policy. Optimal configurations for a team of agents correspond to centroidal Voronoi partitions of the workspace, with each agent converging to the centroid of the respective generalized Voronoi cell. The system's dynamics is written in discrete time form, and the temporal difference form of Bellman's equation is used in the policy iteration learning scheme to train critic and actor weights. Remarkably, the obtained class of solutions is consistent with the one obtained with Lloyd's algorithm, with the advantage that the reinforcement learning formulation allows for a model-free implementation based on data measured along a system's trajectory. By storing an appropriate time history of control actions, the gradient of the value function is numerically approximated, allowing one to run the policy approximation without knowledge of the input dynamics. Direct comparisons with Lloyd's algorithm show the expected slower convergence since Lloyd's trajectories are optimal for the continuous time system.","PeriodicalId":265529,"journal":{"name":"2022 IEEE International Symposium on Robotic and Sensors Environments (ROSE)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Robotic and Sensors Environments (ROSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ROSE56499.2022.9977412","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

We formulate an area coverage control problem with multi-agent systems by using Bellman's principle of optimality. The performance index is composed of the additive contributions of a term quadratic in the control effort and of a positive definite term that depends on the coverage metric. In this way, the reward encodes optimality in the sense of the classical Lloyd's algorithm, where the term depending on the coverage metric weights the energy of the state, and the term depending on the control weights the effort energy. Quasi optimality is achieved by an adaptive control policy using an actor-critic neural networks based reinforcement learning strategy, with quadratic function approximations for the value function and the control policy. Optimal configurations for a team of agents correspond to centroidal Voronoi partitions of the workspace, with each agent converging to the centroid of the respective generalized Voronoi cell. The system's dynamics is written in discrete time form, and the temporal difference form of Bellman's equation is used in the policy iteration learning scheme to train critic and actor weights. Remarkably, the obtained class of solutions is consistent with the one obtained with Lloyd's algorithm, with the advantage that the reinforcement learning formulation allows for a model-free implementation based on data measured along a system's trajectory. By storing an appropriate time history of control actions, the gradient of the value function is numerically approximated, allowing one to run the policy approximation without knowledge of the input dynamics. Direct comparisons with Lloyd's algorithm show the expected slower convergence since Lloyd's trajectories are optimal for the continuous time system.

查看原文本刊更多论文

多智能体系统区域覆盖控制的时间差分学习

利用Bellman最优性原理，提出了一个多智能体系统的区域覆盖控制问题。性能指标由控制工作中的二次项和依赖于覆盖度量的正定项的附加贡献组成。通过这种方式，奖励编码了经典Lloyd算法意义上的最优性，其中依赖于覆盖度量的项对状态的能量进行加权，而依赖于控制的项对努力能量进行加权。拟最优性是通过使用基于参与者-批评神经网络的强化学习策略的自适应控制策略来实现的，该策略对值函数和控制策略进行了二次函数逼近。一组智能体的最优配置对应于工作空间的质心Voronoi分区，每个智能体收敛到各自广义Voronoi单元的质心。系统的动力学以离散时间形式表示，并在策略迭代学习方案中使用Bellman方程的时间差分形式来训练评论家和行动者的权重。值得注意的是，得到的一类解与Lloyd算法得到的解是一致的，其优点是强化学习公式允许基于沿着系统轨迹测量的数据进行无模型实现。通过存储控制动作的适当时间历史，数值逼近值函数的梯度，允许在不知道输入动态的情况下运行策略逼近。与Lloyd算法的直接比较表明，由于Lloyd轨迹对于连续时间系统是最优的，因此预期的收敛速度较慢。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Symposium on Robotic and Sensors Environments (ROSE)

自引率

0.00%

发文量