{"title":"强盗一路下来:UCB1作为蒙特卡洛树搜索的模拟策略","authors":"E. Powley, D. Whitehouse, P. Cowling","doi":"10.1109/CIG.2013.6633613","DOIUrl":null,"url":null,"abstract":"Monte Carlo Tree Search (MCTS) is a family of asymmetric anytime aheuristic game tree search algorithms which have advanced the state-of-the-art in several challenging domains. MCTS learns a playout policy, iteratively building a partial tree to store and further refine the learned portion of the policy. When the playout leaves the existing tree, it falls back to a default simulation policy, which for many variants of MCTS chooses actions uniformly at random. This paper investigates how a simulation policy can be learned during the search, helping the playout policy remain plausible from root to terminal state without the injection of prior knowledge. Since the simulation policy visits states that are previously unseen, its decisions cannot be as context sensitive as those in the tree policy. We consider the well-known Move-Average Sampling Technique (MAST), which learns a value for each move which is independent of context. We also introduce a generalisation of MAST, called N-gram-Average-Sampling-Technique (NAST), which uses as context a fixed-lengthsequence (or N-tuple) of recent moves. We compare several policies for selecting moves during simulation, including the UCB1 policy for multi-armed bandits (as used in the tree policy for the popular UCT variant of MCTS). In addition to the elegance of treating the entire playout as a series of multi-armed bandit problems, we find that UCB1 gives consistently strong performance. We present empirical results for three games of imperfect information, namely the card games Dou Di Zhu and Hearts and the board game Lord Of The Rings: The Confrontation, each of which has its own unique challenges for search-based AI.","PeriodicalId":158902,"journal":{"name":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"Bandits all the way down: UCB1 as a simulation policy in Monte Carlo Tree Search\",\"authors\":\"E. Powley, D. Whitehouse, P. Cowling\",\"doi\":\"10.1109/CIG.2013.6633613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Monte Carlo Tree Search (MCTS) is a family of asymmetric anytime aheuristic game tree search algorithms which have advanced the state-of-the-art in several challenging domains. MCTS learns a playout policy, iteratively building a partial tree to store and further refine the learned portion of the policy. When the playout leaves the existing tree, it falls back to a default simulation policy, which for many variants of MCTS chooses actions uniformly at random. This paper investigates how a simulation policy can be learned during the search, helping the playout policy remain plausible from root to terminal state without the injection of prior knowledge. Since the simulation policy visits states that are previously unseen, its decisions cannot be as context sensitive as those in the tree policy. We consider the well-known Move-Average Sampling Technique (MAST), which learns a value for each move which is independent of context. We also introduce a generalisation of MAST, called N-gram-Average-Sampling-Technique (NAST), which uses as context a fixed-lengthsequence (or N-tuple) of recent moves. We compare several policies for selecting moves during simulation, including the UCB1 policy for multi-armed bandits (as used in the tree policy for the popular UCT variant of MCTS). In addition to the elegance of treating the entire playout as a series of multi-armed bandit problems, we find that UCB1 gives consistently strong performance. We present empirical results for three games of imperfect information, namely the card games Dou Di Zhu and Hearts and the board game Lord Of The Rings: The Confrontation, each of which has its own unique challenges for search-based AI.\",\"PeriodicalId\":158902,\"journal\":{\"name\":\"2013 IEEE Conference on Computational Inteligence in Games (CIG)\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE Conference on Computational Inteligence in Games (CIG)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIG.2013.6633613\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIG.2013.6633613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Bandits all the way down: UCB1 as a simulation policy in Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) is a family of asymmetric anytime aheuristic game tree search algorithms which have advanced the state-of-the-art in several challenging domains. MCTS learns a playout policy, iteratively building a partial tree to store and further refine the learned portion of the policy. When the playout leaves the existing tree, it falls back to a default simulation policy, which for many variants of MCTS chooses actions uniformly at random. This paper investigates how a simulation policy can be learned during the search, helping the playout policy remain plausible from root to terminal state without the injection of prior knowledge. Since the simulation policy visits states that are previously unseen, its decisions cannot be as context sensitive as those in the tree policy. We consider the well-known Move-Average Sampling Technique (MAST), which learns a value for each move which is independent of context. We also introduce a generalisation of MAST, called N-gram-Average-Sampling-Technique (NAST), which uses as context a fixed-lengthsequence (or N-tuple) of recent moves. We compare several policies for selecting moves during simulation, including the UCB1 policy for multi-armed bandits (as used in the tree policy for the popular UCT variant of MCTS). In addition to the elegance of treating the entire playout as a series of multi-armed bandit problems, we find that UCB1 gives consistently strong performance. We present empirical results for three games of imperfect information, namely the card games Dou Di Zhu and Hearts and the board game Lord Of The Rings: The Confrontation, each of which has its own unique challenges for search-based AI.