基于轨迹表示聚类的对手建模

Intelligence & Robotics Pub Date : 1900-01-01 DOI:10.20517/ir.2022.09

Yongliang Lv, Yan Zheng, Jianye Hao

{"title":"基于轨迹表示聚类的对手建模","authors":"Yongliang Lv, Yan Zheng, Jianye Hao","doi":"10.20517/ir.2022.09","DOIUrl":null,"url":null,"abstract":"For a non-stationary opponent in a multi-agent environment, traditional methods model the opponent through its complex information to learn one or more optimal response policies. However, the response policy learned earlier is prone to catastrophic forgetting due to data imbalance in the online-updated replay buffer for non-stationary changes of opponent policies. This paper focuses on how to learn new response policies without forgetting old policies that have been learned when the opponent policy is constantly changing. We extract the representation of opponent policies and make explicit clustering distinctions through the contrastive learning autoencoder. With the idea of balancing the replay buffer, we maintain continuous learning of the trajectory data of various opponent policies that have appeared to avoid policy forgetting. Finally, we demonstrate the effectiveness of the method under a classical opponent modeling environment (soccer) and show the clustering effect of different opponent policies.","PeriodicalId":426514,"journal":{"name":"Intelligence & Robotics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Opponent modeling with trajectory representation clustering\",\"authors\":\"Yongliang Lv, Yan Zheng, Jianye Hao\",\"doi\":\"10.20517/ir.2022.09\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For a non-stationary opponent in a multi-agent environment, traditional methods model the opponent through its complex information to learn one or more optimal response policies. However, the response policy learned earlier is prone to catastrophic forgetting due to data imbalance in the online-updated replay buffer for non-stationary changes of opponent policies. This paper focuses on how to learn new response policies without forgetting old policies that have been learned when the opponent policy is constantly changing. We extract the representation of opponent policies and make explicit clustering distinctions through the contrastive learning autoencoder. With the idea of balancing the replay buffer, we maintain continuous learning of the trajectory data of various opponent policies that have appeared to avoid policy forgetting. Finally, we demonstrate the effectiveness of the method under a classical opponent modeling environment (soccer) and show the clustering effect of different opponent policies.\",\"PeriodicalId\":426514,\"journal\":{\"name\":\"Intelligence & Robotics\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligence & Robotics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.20517/ir.2022.09\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence & Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20517/ir.2022.09","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

对于多智能体环境中的非平稳对手，传统方法是通过对手的复杂信息对其建模，以学习一个或多个最优响应策略。然而，对于对手策略的非平稳变化，由于在线更新重放缓冲区中的数据不平衡，早期学习到的响应策略容易发生灾难性遗忘。本文主要研究在对手策略不断变化的情况下，如何学习新的响应策略而不忘记已经学习到的旧策略。我们通过对比学习自编码器提取对手策略的表示并进行明确的聚类区分。在平衡重放缓冲的思想下，我们持续学习各种对手策略的轨迹数据，以避免策略遗忘。最后，我们在经典对手建模环境(足球)下验证了该方法的有效性，并展示了不同对手策略的聚类效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Opponent modeling with trajectory representation clustering

For a non-stationary opponent in a multi-agent environment, traditional methods model the opponent through its complex information to learn one or more optimal response policies. However, the response policy learned earlier is prone to catastrophic forgetting due to data imbalance in the online-updated replay buffer for non-stationary changes of opponent policies. This paper focuses on how to learn new response policies without forgetting old policies that have been learned when the opponent policy is constantly changing. We extract the representation of opponent policies and make explicit clustering distinctions through the contrastive learning autoencoder. With the idea of balancing the replay buffer, we maintain continuous learning of the trajectory data of various opponent policies that have appeared to avoid policy forgetting. Finally, we demonstrate the effectiveness of the method under a classical opponent modeling environment (soccer) and show the clustering effect of different opponent policies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Intelligence & Robotics

自引率

0.00%

发文量