Bandits all the way down: UCB1 as a simulation policy in Monte Carlo Tree Search

E. Powley, D. Whitehouse, P. Cowling
{"title":"Bandits all the way down: UCB1 as a simulation policy in Monte Carlo Tree Search","authors":"E. Powley, D. Whitehouse, P. Cowling","doi":"10.1109/CIG.2013.6633613","DOIUrl":null,"url":null,"abstract":"Monte Carlo Tree Search (MCTS) is a family of asymmetric anytime aheuristic game tree search algorithms which have advanced the state-of-the-art in several challenging domains. MCTS learns a playout policy, iteratively building a partial tree to store and further refine the learned portion of the policy. When the playout leaves the existing tree, it falls back to a default simulation policy, which for many variants of MCTS chooses actions uniformly at random. This paper investigates how a simulation policy can be learned during the search, helping the playout policy remain plausible from root to terminal state without the injection of prior knowledge. Since the simulation policy visits states that are previously unseen, its decisions cannot be as context sensitive as those in the tree policy. We consider the well-known Move-Average Sampling Technique (MAST), which learns a value for each move which is independent of context. We also introduce a generalisation of MAST, called N-gram-Average-Sampling-Technique (NAST), which uses as context a fixed-lengthsequence (or N-tuple) of recent moves. We compare several policies for selecting moves during simulation, including the UCB1 policy for multi-armed bandits (as used in the tree policy for the popular UCT variant of MCTS). In addition to the elegance of treating the entire playout as a series of multi-armed bandit problems, we find that UCB1 gives consistently strong performance. We present empirical results for three games of imperfect information, namely the card games Dou Di Zhu and Hearts and the board game Lord Of The Rings: The Confrontation, each of which has its own unique challenges for search-based AI.","PeriodicalId":158902,"journal":{"name":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIG.2013.6633613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

Monte Carlo Tree Search (MCTS) is a family of asymmetric anytime aheuristic game tree search algorithms which have advanced the state-of-the-art in several challenging domains. MCTS learns a playout policy, iteratively building a partial tree to store and further refine the learned portion of the policy. When the playout leaves the existing tree, it falls back to a default simulation policy, which for many variants of MCTS chooses actions uniformly at random. This paper investigates how a simulation policy can be learned during the search, helping the playout policy remain plausible from root to terminal state without the injection of prior knowledge. Since the simulation policy visits states that are previously unseen, its decisions cannot be as context sensitive as those in the tree policy. We consider the well-known Move-Average Sampling Technique (MAST), which learns a value for each move which is independent of context. We also introduce a generalisation of MAST, called N-gram-Average-Sampling-Technique (NAST), which uses as context a fixed-lengthsequence (or N-tuple) of recent moves. We compare several policies for selecting moves during simulation, including the UCB1 policy for multi-armed bandits (as used in the tree policy for the popular UCT variant of MCTS). In addition to the elegance of treating the entire playout as a series of multi-armed bandit problems, we find that UCB1 gives consistently strong performance. We present empirical results for three games of imperfect information, namely the card games Dou Di Zhu and Hearts and the board game Lord Of The Rings: The Confrontation, each of which has its own unique challenges for search-based AI.
强盗一路下来:UCB1作为蒙特卡洛树搜索的模拟策略
蒙特卡罗树搜索(MCTS)是一种非对称随时启发式博弈树搜索算法,在一些具有挑战性的领域取得了进展。MCTS学习一个播放策略,迭代地构建一个部分树来存储和进一步细化策略的学习部分。当播放离开现有树时,它会回到默认的模拟策略,对于许多MCTS变体,该策略会随机选择一致的动作。本文研究了如何在搜索过程中学习模拟策略,以帮助播放策略在不注入先验知识的情况下从根状态到终端状态保持可信。由于模拟策略访问以前未见过的状态,因此它的决策不能像树策略中的决策那样对上下文敏感。我们考虑了著名的移动平均采样技术(MAST),它为每个独立于上下文的移动学习一个值。我们还介绍了MAST的泛化,称为n -gram平均抽样技术(NAST),它使用最近移动的固定长度序列(或n元组)作为上下文。在模拟过程中,我们比较了几种选择招式的策略,包括多武装强盗的UCB1策略(如流行的MCTS UCT变体的树策略)。除了将整个游戏过程视为一系列多手强盗问题的优雅之外,我们发现UCB1的表现一直很强劲。我们给出了三种不完全信息游戏的实证结果,即纸牌游戏《豆地珠》和《红心》以及棋盘游戏《指环王:对抗》,每种游戏对基于搜索的AI都有自己独特的挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信