Extreme Value Monte Carlo Tree Search (Extended Abstract)

Masataro Asai, Stephen Wissow
{"title":"Extreme Value Monte Carlo Tree Search (Extended Abstract)","authors":"Masataro Asai, Stephen Wissow","doi":"10.1609/socs.v17i1.31569","DOIUrl":null,"url":null,"abstract":"Monte-Carlo Tree Search (MCTS) combined with Multi-Armed Bandit (MAB) has had limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai 2023) showed that UCB1, designed for bounded rewards, does not perform well when applied to the cost-to-go estimates of classical planning, which are unbounded in R, then improved the performance by using a Gaussian reward MAB instead. We further sharpen our understanding of ideal bandits for planning tasks by resolving three issues: First, Gaussian MABs under-specify the support of cost-to-go estimates as [−∞, ∞]. Second, Full-Bellman backup that backpropagates max/min of samples lacks theoretical justifications. Third, removing dead-ends lacks justifications in Monte-Carlo backup. We use Extreme Value Theory Type 2 to resolve them at once, propose two bandits (UCB1-Uniform/Power), and apply them to MCTS for classical planning. We formally prove their regret bounds and empirically demonstrate their performance in classical planning.","PeriodicalId":425645,"journal":{"name":"Symposium on Combinatorial Search","volume":"56 7","pages":"257-258"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Symposium on Combinatorial Search","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/socs.v17i1.31569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Monte-Carlo Tree Search (MCTS) combined with Multi-Armed Bandit (MAB) has had limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai 2023) showed that UCB1, designed for bounded rewards, does not perform well when applied to the cost-to-go estimates of classical planning, which are unbounded in R, then improved the performance by using a Gaussian reward MAB instead. We further sharpen our understanding of ideal bandits for planning tasks by resolving three issues: First, Gaussian MABs under-specify the support of cost-to-go estimates as [−∞, ∞]. Second, Full-Bellman backup that backpropagates max/min of samples lacks theoretical justifications. Third, removing dead-ends lacks justifications in Monte-Carlo backup. We use Extreme Value Theory Type 2 to resolve them at once, propose two bandits (UCB1-Uniform/Power), and apply them to MCTS for classical planning. We formally prove their regret bounds and empirically demonstrate their performance in classical planning.
极值蒙特卡洛树搜索(扩展摘要)
蒙特卡洛树搜索(Monte-Carlo Tree Search,MCTS)与多臂匪帮(Multi-Armed Bandit,MAB)相结合,直到最近才在与领域无关的经典规划中取得了有限的成功。之前的研究(Wissow 和 Asai,2023 年)表明,为有界奖励而设计的 UCB1 在应用于经典规划的成本到目标估算时表现不佳,而经典规划的成本到目标估算在 R 中是无界的。通过解决三个问题,我们进一步加深了对规划任务中理想匪帮的理解:首先,高斯 MAB 未将成本到目标估计的支持指定为 [-∞, ∞]。第二,反向传播最大/最小样本的 Full-Bellman 备份缺乏理论依据。第三,在蒙特卡洛备份中,去除死胡同缺乏理论依据。我们利用极值理论第二类一次性解决了这些问题,提出了两种匪帮(UCB1-Uniform/Power),并将它们应用于经典规划的 MCTS。我们正式证明了它们的后悔界限,并通过实证证明了它们在经典规划中的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信