{"title":"基于足球比赛的多智能体强化学习研究","authors":"Danping Wu","doi":"10.1109/CCAI57533.2023.10201281","DOIUrl":null,"url":null,"abstract":"Games are classic scenarios for reinforcement learning, and the support of a variety of standard tasks and experimental platforms is one of the reasons for the success of reinforcement learning. In the actual environment, interacting with other individuals is often necessary to be considered, such as in games Go, Poker, football and tennis, which require multi-agent participation. This paper implemented some multi-agent investigation based on Google Research Football to show the performance gap between the centralized trained policy, which produces one action when called; and a unified “super agent”, which produces multiple actions simultaneously. Then it compared the sensitivity of the learned policies by perturbing the policy parameters / observations / initial states / cooperative agent. In this case, Proximal Policy Optimization (PPO) method has faster convergence than IMPALA, and using mean provides higher episode mean reward than using the max operator for aggregation. Besides, a multi-player experiment was also tried out, the results show that the less the episode difference is, the greater the expected return it gets.","PeriodicalId":285760,"journal":{"name":"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Agent Reinforcement Learning Investigation Based on Football Games\",\"authors\":\"Danping Wu\",\"doi\":\"10.1109/CCAI57533.2023.10201281\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Games are classic scenarios for reinforcement learning, and the support of a variety of standard tasks and experimental platforms is one of the reasons for the success of reinforcement learning. In the actual environment, interacting with other individuals is often necessary to be considered, such as in games Go, Poker, football and tennis, which require multi-agent participation. This paper implemented some multi-agent investigation based on Google Research Football to show the performance gap between the centralized trained policy, which produces one action when called; and a unified “super agent”, which produces multiple actions simultaneously. Then it compared the sensitivity of the learned policies by perturbing the policy parameters / observations / initial states / cooperative agent. In this case, Proximal Policy Optimization (PPO) method has faster convergence than IMPALA, and using mean provides higher episode mean reward than using the max operator for aggregation. Besides, a multi-player experiment was also tried out, the results show that the less the episode difference is, the greater the expected return it gets.\",\"PeriodicalId\":285760,\"journal\":{\"name\":\"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCAI57533.2023.10201281\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCAI57533.2023.10201281","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
游戏是强化学习的经典场景,各种标准任务和实验平台的支持是强化学习成功的原因之一。在实际环境中,通常需要考虑与其他个体的互动,例如在围棋、扑克、足球和网球游戏中,这些游戏都需要多智能体参与。本文在Google Research Football的基础上进行了多智能体调查,显示了集中训练策略的性能差距:当调用时只产生一个动作;一个统一的“超级代理”,同时产生多种动作。然后通过扰动策略参数/观测值/初始状态/合作代理来比较学习策略的敏感性。在这种情况下,PPO (Proximal Policy Optimization)方法比IMPALA具有更快的收敛速度,并且使用均值比使用max算子提供更高的集均值奖励。此外,还进行了多人博弈实验,结果表明,情节差异越小,期望收益越大。
Multi-Agent Reinforcement Learning Investigation Based on Football Games
Games are classic scenarios for reinforcement learning, and the support of a variety of standard tasks and experimental platforms is one of the reasons for the success of reinforcement learning. In the actual environment, interacting with other individuals is often necessary to be considered, such as in games Go, Poker, football and tennis, which require multi-agent participation. This paper implemented some multi-agent investigation based on Google Research Football to show the performance gap between the centralized trained policy, which produces one action when called; and a unified “super agent”, which produces multiple actions simultaneously. Then it compared the sensitivity of the learned policies by perturbing the policy parameters / observations / initial states / cooperative agent. In this case, Proximal Policy Optimization (PPO) method has faster convergence than IMPALA, and using mean provides higher episode mean reward than using the max operator for aggregation. Besides, a multi-player experiment was also tried out, the results show that the less the episode difference is, the greater the expected return it gets.