蒙特卡洛树搜索中简单遗憾最小化的探索规则

2015 IEEE Conference on Computational Intelligence and Games (CIG) Pub Date : 2015-11-05 DOI:10.1109/CIG.2015.7317923

Yun-Ching Liu, Yoshimasa Tsuruoka

{"title":"蒙特卡洛树搜索中简单遗憾最小化的探索规则","authors":"Yun-Ching Liu, Yoshimasa Tsuruoka","doi":"10.1109/CIG.2015.7317923","DOIUrl":null,"url":null,"abstract":"The application of multi-armed bandit (MAB) algorithms was a critical step in the development of Monte-Carlo tree search (MCTS). One example would be the UCT algorithm, which applies the UCB bandit algorithm. Various research has been conducted on applying other bandit algorithms to MCTS. Simple regret bandit algorithms, which aim to identify the optimal arm after a number of trials, have been of great interest in various fields in recent years. However, the simple regret bandit algorithm has the tendency to spend more time on sampling suboptimal arms, which may be a problem in the context of game tree search. In this research, we will propose combined confidence bounds, which utilize the characteristics of the confidence bounds of the improved UCB and UCB √· algorithms to regulate exploration for simple regret minimization in MCTS. We will demonstrate the combined confidence bounds bandit algorithm has better empirical performance than that of the UCB algorithm on the MAB problem. We will show that the combined confidence bounds MCTS (CCB-MCTS) has better performance over plain UCT on the game of 9 × 9 Go, and has shown good scalability. We will also show that the performance of CCB-MCTS can be further enhanced with the application of all-moves-as-first (AMAF) heuristic.","PeriodicalId":244862,"journal":{"name":"2015 IEEE Conference on Computational Intelligence and Games (CIG)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Regulation of exploration for simple regret minimization in Monte-Carlo tree search\",\"authors\":\"Yun-Ching Liu, Yoshimasa Tsuruoka\",\"doi\":\"10.1109/CIG.2015.7317923\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The application of multi-armed bandit (MAB) algorithms was a critical step in the development of Monte-Carlo tree search (MCTS). One example would be the UCT algorithm, which applies the UCB bandit algorithm. Various research has been conducted on applying other bandit algorithms to MCTS. Simple regret bandit algorithms, which aim to identify the optimal arm after a number of trials, have been of great interest in various fields in recent years. However, the simple regret bandit algorithm has the tendency to spend more time on sampling suboptimal arms, which may be a problem in the context of game tree search. In this research, we will propose combined confidence bounds, which utilize the characteristics of the confidence bounds of the improved UCB and UCB √· algorithms to regulate exploration for simple regret minimization in MCTS. We will demonstrate the combined confidence bounds bandit algorithm has better empirical performance than that of the UCB algorithm on the MAB problem. We will show that the combined confidence bounds MCTS (CCB-MCTS) has better performance over plain UCT on the game of 9 × 9 Go, and has shown good scalability. We will also show that the performance of CCB-MCTS can be further enhanced with the application of all-moves-as-first (AMAF) heuristic.\",\"PeriodicalId\":244862,\"journal\":{\"name\":\"2015 IEEE Conference on Computational Intelligence and Games (CIG)\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE Conference on Computational Intelligence and Games (CIG)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIG.2015.7317923\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Conference on Computational Intelligence and Games (CIG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIG.2015.7317923","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

多臂强盗算法(MAB)的应用是蒙特卡罗树搜索(MCTS)发展的关键一步。UCT算法就是一个例子，它应用了UCB强盗算法。关于将其他强盗算法应用于MCTS的各种研究已经展开。简单遗憾强盗算法旨在通过多次试验确定最优手臂，近年来在各个领域引起了人们的极大兴趣。然而，简单的遗憾强盗算法倾向于花费更多的时间来采样次优臂，这在游戏树搜索的背景下可能是一个问题。在本研究中，我们将提出组合置信界限，利用改进的UCB和UCB√·算法的置信界限特征来调节MCTS中简单遗憾最小化的探索。我们将证明在MAB问题上，组合置信边界算法比UCB算法具有更好的经验性能。我们将证明，在9 × 9围棋游戏中，组合置信界限MCTS (CCB-MCTS)比普通的UCT具有更好的性能，并且表现出良好的可扩展性。我们还将证明，应用所有动作优先(AMAF)启发式可以进一步提高CCB-MCTS的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Regulation of exploration for simple regret minimization in Monte-Carlo tree search

The application of multi-armed bandit (MAB) algorithms was a critical step in the development of Monte-Carlo tree search (MCTS). One example would be the UCT algorithm, which applies the UCB bandit algorithm. Various research has been conducted on applying other bandit algorithms to MCTS. Simple regret bandit algorithms, which aim to identify the optimal arm after a number of trials, have been of great interest in various fields in recent years. However, the simple regret bandit algorithm has the tendency to spend more time on sampling suboptimal arms, which may be a problem in the context of game tree search. In this research, we will propose combined confidence bounds, which utilize the characteristics of the confidence bounds of the improved UCB and UCB √· algorithms to regulate exploration for simple regret minimization in MCTS. We will demonstrate the combined confidence bounds bandit algorithm has better empirical performance than that of the UCB algorithm on the MAB problem. We will show that the combined confidence bounds MCTS (CCB-MCTS) has better performance over plain UCT on the game of 9 × 9 Go, and has shown good scalability. We will also show that the performance of CCB-MCTS can be further enhanced with the application of all-moves-as-first (AMAF) heuristic.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE Conference on Computational Intelligence and Games (CIG)

自引率

0.00%

发文量