强盗一路下来:UCB1作为蒙特卡洛树搜索的模拟策略

2013 IEEE Conference on Computational Inteligence in Games (CIG) Pub Date : 2013-10-17 DOI:10.1109/CIG.2013.6633613

E. Powley, D. Whitehouse, P. Cowling

{"title":"强盗一路下来:UCB1作为蒙特卡洛树搜索的模拟策略","authors":"E. Powley, D. Whitehouse, P. Cowling","doi":"10.1109/CIG.2013.6633613","DOIUrl":null,"url":null,"abstract":"Monte Carlo Tree Search (MCTS) is a family of asymmetric anytime aheuristic game tree search algorithms which have advanced the state-of-the-art in several challenging domains. MCTS learns a playout policy, iteratively building a partial tree to store and further refine the learned portion of the policy. When the playout leaves the existing tree, it falls back to a default simulation policy, which for many variants of MCTS chooses actions uniformly at random. This paper investigates how a simulation policy can be learned during the search, helping the playout policy remain plausible from root to terminal state without the injection of prior knowledge. Since the simulation policy visits states that are previously unseen, its decisions cannot be as context sensitive as those in the tree policy. We consider the well-known Move-Average Sampling Technique (MAST), which learns a value for each move which is independent of context. We also introduce a generalisation of MAST, called N-gram-Average-Sampling-Technique (NAST), which uses as context a fixed-lengthsequence (or N-tuple) of recent moves. We compare several policies for selecting moves during simulation, including the UCB1 policy for multi-armed bandits (as used in the tree policy for the popular UCT variant of MCTS). In addition to the elegance of treating the entire playout as a series of multi-armed bandit problems, we find that UCB1 gives consistently strong performance. We present empirical results for three games of imperfect information, namely the card games Dou Di Zhu and Hearts and the board game Lord Of The Rings: The Confrontation, each of which has its own unique challenges for search-based AI.","PeriodicalId":158902,"journal":{"name":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"Bandits all the way down: UCB1 as a simulation policy in Monte Carlo Tree Search\",\"authors\":\"E. Powley, D. Whitehouse, P. Cowling\",\"doi\":\"10.1109/CIG.2013.6633613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Monte Carlo Tree Search (MCTS) is a family of asymmetric anytime aheuristic game tree search algorithms which have advanced the state-of-the-art in several challenging domains. MCTS learns a playout policy, iteratively building a partial tree to store and further refine the learned portion of the policy. When the playout leaves the existing tree, it falls back to a default simulation policy, which for many variants of MCTS chooses actions uniformly at random. This paper investigates how a simulation policy can be learned during the search, helping the playout policy remain plausible from root to terminal state without the injection of prior knowledge. Since the simulation policy visits states that are previously unseen, its decisions cannot be as context sensitive as those in the tree policy. We consider the well-known Move-Average Sampling Technique (MAST), which learns a value for each move which is independent of context. We also introduce a generalisation of MAST, called N-gram-Average-Sampling-Technique (NAST), which uses as context a fixed-lengthsequence (or N-tuple) of recent moves. We compare several policies for selecting moves during simulation, including the UCB1 policy for multi-armed bandits (as used in the tree policy for the popular UCT variant of MCTS). In addition to the elegance of treating the entire playout as a series of multi-armed bandit problems, we find that UCB1 gives consistently strong performance. We present empirical results for three games of imperfect information, namely the card games Dou Di Zhu and Hearts and the board game Lord Of The Rings: The Confrontation, each of which has its own unique challenges for search-based AI.\",\"PeriodicalId\":158902,\"journal\":{\"name\":\"2013 IEEE Conference on Computational Inteligence in Games (CIG)\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE Conference on Computational Inteligence in Games (CIG)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIG.2013.6633613\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIG.2013.6633613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

摘要

蒙特卡罗树搜索(MCTS)是一种非对称随时启发式博弈树搜索算法，在一些具有挑战性的领域取得了进展。MCTS学习一个播放策略，迭代地构建一个部分树来存储和进一步细化策略的学习部分。当播放离开现有树时，它会回到默认的模拟策略，对于许多MCTS变体，该策略会随机选择一致的动作。本文研究了如何在搜索过程中学习模拟策略，以帮助播放策略在不注入先验知识的情况下从根状态到终端状态保持可信。由于模拟策略访问以前未见过的状态，因此它的决策不能像树策略中的决策那样对上下文敏感。我们考虑了著名的移动平均采样技术(MAST)，它为每个独立于上下文的移动学习一个值。我们还介绍了MAST的泛化，称为n -gram平均抽样技术(NAST)，它使用最近移动的固定长度序列(或n元组)作为上下文。在模拟过程中，我们比较了几种选择招式的策略，包括多武装强盗的UCB1策略(如流行的MCTS UCT变体的树策略)。除了将整个游戏过程视为一系列多手强盗问题的优雅之外，我们发现UCB1的表现一直很强劲。我们给出了三种不完全信息游戏的实证结果，即纸牌游戏《豆地珠》和《红心》以及棋盘游戏《指环王:对抗》，每种游戏对基于搜索的AI都有自己独特的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bandits all the way down: UCB1 as a simulation policy in Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a family of asymmetric anytime aheuristic game tree search algorithms which have advanced the state-of-the-art in several challenging domains. MCTS learns a playout policy, iteratively building a partial tree to store and further refine the learned portion of the policy. When the playout leaves the existing tree, it falls back to a default simulation policy, which for many variants of MCTS chooses actions uniformly at random. This paper investigates how a simulation policy can be learned during the search, helping the playout policy remain plausible from root to terminal state without the injection of prior knowledge. Since the simulation policy visits states that are previously unseen, its decisions cannot be as context sensitive as those in the tree policy. We consider the well-known Move-Average Sampling Technique (MAST), which learns a value for each move which is independent of context. We also introduce a generalisation of MAST, called N-gram-Average-Sampling-Technique (NAST), which uses as context a fixed-lengthsequence (or N-tuple) of recent moves. We compare several policies for selecting moves during simulation, including the UCB1 policy for multi-armed bandits (as used in the tree policy for the popular UCT variant of MCTS). In addition to the elegance of treating the entire playout as a series of multi-armed bandit problems, we find that UCB1 gives consistently strong performance. We present empirical results for three games of imperfect information, namely the card games Dou Di Zhu and Hearts and the board game Lord Of The Rings: The Confrontation, each of which has its own unique challenges for search-based AI.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE Conference on Computational Inteligence in Games (CIG)

自引率

0.00%

发文量