More Human-Like Gameplay by Blending Policies From Supervised and Reinforcement Learning

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Games Pub Date : 2024-07-11 DOI:10.1109/TG.2024.3424668

Tatsuyoshi Ogawa;Chu-Hsuan Hsueh;Kokolo Ikeda

{"title":"More Human-Like Gameplay by Blending Policies From Supervised and Reinforcement Learning","authors":"Tatsuyoshi Ogawa;Chu-Hsuan Hsueh;Kokolo Ikeda","doi":"10.1109/TG.2024.3424668","DOIUrl":null,"url":null,"abstract":"Modeling human players' behaviors in games is a key challenge for making natural computer players, evaluating games, and generating content. To achieve better human–computer interaction, researchers have tried various methods to create human-like artificial intelligence. In chess and \n<italic>Go</i>\n, supervised learning with deep neural networks is known as one of the most effective ways to predict human moves. However, for many other games (e.g., \n<italic>Shogi</i>\n), it is hard to collect a similar amount of game records, resulting in poor move-matching accuracy of the supervised learning. We propose a method to compensate for the weakness of the supervised learning policy by Blending it with an AlphaZero-like reinforcement learning policy. Experiments on \n<italic>Shogi</i>\n showed that the Blend method significantly improved the move-matching accuracy over supervised learning models. Experiments on chess and \n<italic>Go</i>\n with a limited number of game records also showed similar results. The Blend method was effective with both medium and large numbers of games, particularly the medium case. We confirmed the robustness of the Blend model to the parameter and discussed the mechanism why the move-matching accuracy improves. In addition, we showed that the Blend model performed better than existing work that tried to improve the move-matching accuracy.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"16 4","pages":"831-843"},"PeriodicalIF":1.7000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10595450","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10595450/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Modeling human players' behaviors in games is a key challenge for making natural computer players, evaluating games, and generating content. To achieve better human–computer interaction, researchers have tried various methods to create human-like artificial intelligence. In chess and Go , supervised learning with deep neural networks is known as one of the most effective ways to predict human moves. However, for many other games (e.g., Shogi ), it is hard to collect a similar amount of game records, resulting in poor move-matching accuracy of the supervised learning. We propose a method to compensate for the weakness of the supervised learning policy by Blending it with an AlphaZero-like reinforcement learning policy. Experiments on Shogi showed that the Blend method significantly improved the move-matching accuracy over supervised learning models. Experiments on chess and Go with a limited number of game records also showed similar results. The Blend method was effective with both medium and large numbers of games, particularly the medium case. We confirmed the robustness of the Blend model to the parameter and discussed the mechanism why the move-matching accuracy improves. In addition, we showed that the Blend model performed better than existing work that tried to improve the move-matching accuracy.

查看原文本刊更多论文

通过融合监督学习和强化学习的政策，让游戏玩法更接近人类

模拟人类玩家在游戏中的行为是创造自然计算机玩家、评估游戏和生成内容的关键挑战。为了实现更好的人机交互，研究人员尝试了各种方法来创造类人的人工智能。在国际象棋和围棋中，深度神经网络的监督学习被认为是预测人类棋路的最有效方法之一。然而，对于许多其他游戏（例如，Shogi），很难收集到类似数量的游戏记录，导致监督学习的移动匹配准确性很差。我们提出了一种方法，通过将监督学习策略与类似alphazero的强化学习策略混合来弥补监督学习策略的弱点。在Shogi上的实验表明，Blend方法显著提高了监督学习模型的运动匹配精度。对国际象棋和围棋的实验也显示了类似的结果。混合方法对中型和大型游戏都有效，尤其是中型游戏。验证了混合模型对参数的鲁棒性，并讨论了运动匹配精度提高的机理。此外，我们表明Blend模型比现有的试图提高移动匹配精度的工作表现得更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Games Engineering-Electrical and Electronic Engineering

CiteScore

4.60

自引率

8.70%

发文量