类alphazero深度强化学习的超参数分析

Haibo Wang, M. Emmerich, M. Preuss, A. Plaat
{"title":"类alphazero深度强化学习的超参数分析","authors":"Haibo Wang, M. Emmerich, M. Preuss, A. Plaat","doi":"10.1142/s0219622022500547","DOIUrl":null,"url":null,"abstract":"The landmark achievements of AlphaGo Zero have created great research interest into self-play in reinforcement learning. In self-play, Monte Carlo Tree Search is used to train a deep neural network, which is then used itself in tree searches. The training is gov- erned by many hyper-parameters. There has been surprisingly little research on design choices for hyper-parameter values and loss functions, presumably because of the pro- hibitive computational cost to explore the parameter space. In this paper, we investigate 12 hyper-parameters in an AlphaZero-like self-play algorithm and evaluate how these parameters contribute to training. We study them on small games, to achieve meaningful exploration with moderate computational effort. The experimental results show that training is highly sensitive to hyper-parameter choices. Through multi-objective analysis, we identify 4 important hyper-parameters to further assess. To start, we find surprising results where too much training can sometimes lead to lower performance. Our main result is that the number of self-play iterations subsumes MCTS-search sim- ulations, game episodes, and training epochs. The intuition is that these three increase together as self-play iterations increase and that increasing them individually is sub- optimal. As a consequence of our experiments, we provide recommendations on setting hyper-parameter values in self-play. The outer loop of self-play iterations should be em- phasized, in favor of the inner loop. This means hyper-parameters for the inner loop, should be set to lower values. A secondary result of our experiments concerns the choice of optimization goals, for which we also provide recommendations.","PeriodicalId":13527,"journal":{"name":"Int. J. Inf. Technol. Decis. Mak.","volume":"87 1","pages":"829-853"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analysis of Hyper-Parameters for AlphaZero-Like Deep Reinforcement Learning\",\"authors\":\"Haibo Wang, M. Emmerich, M. Preuss, A. Plaat\",\"doi\":\"10.1142/s0219622022500547\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The landmark achievements of AlphaGo Zero have created great research interest into self-play in reinforcement learning. In self-play, Monte Carlo Tree Search is used to train a deep neural network, which is then used itself in tree searches. The training is gov- erned by many hyper-parameters. There has been surprisingly little research on design choices for hyper-parameter values and loss functions, presumably because of the pro- hibitive computational cost to explore the parameter space. In this paper, we investigate 12 hyper-parameters in an AlphaZero-like self-play algorithm and evaluate how these parameters contribute to training. We study them on small games, to achieve meaningful exploration with moderate computational effort. The experimental results show that training is highly sensitive to hyper-parameter choices. Through multi-objective analysis, we identify 4 important hyper-parameters to further assess. To start, we find surprising results where too much training can sometimes lead to lower performance. Our main result is that the number of self-play iterations subsumes MCTS-search sim- ulations, game episodes, and training epochs. The intuition is that these three increase together as self-play iterations increase and that increasing them individually is sub- optimal. As a consequence of our experiments, we provide recommendations on setting hyper-parameter values in self-play. The outer loop of self-play iterations should be em- phasized, in favor of the inner loop. This means hyper-parameters for the inner loop, should be set to lower values. A secondary result of our experiments concerns the choice of optimization goals, for which we also provide recommendations.\",\"PeriodicalId\":13527,\"journal\":{\"name\":\"Int. J. Inf. Technol. Decis. Mak.\",\"volume\":\"87 1\",\"pages\":\"829-853\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Inf. Technol. Decis. Mak.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/s0219622022500547\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Inf. Technol. Decis. Mak.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219622022500547","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

AlphaGo Zero取得的里程碑式的成就引起了人们对强化学习中自对弈的极大研究兴趣。在自我游戏中,蒙特卡罗树搜索被用来训练一个深度神经网络,然后将其自身用于树搜索。训练是由许多超参数控制的。关于超参数值和损失函数的设计选择的研究少得惊人,可能是因为探索参数空间的计算成本过高。在本文中,我们研究了类似alphazero的自对弈算法中的12个超参数,并评估了这些参数对训练的贡献。我们在小型游戏中研究它们,以适度的计算量实现有意义的探索。实验结果表明,训练对超参数选择高度敏感。通过多目标分析,我们确定了4个重要的超参数,以进一步评估。首先,我们发现了令人惊讶的结果,过多的训练有时会导致较低的表现。我们的主要结果是,自我游戏迭代的次数包含了mcts搜索模拟、游戏情节和训练时期。直觉告诉我们,这三者会随着自我游戏迭代的增加而一起增加,而单独增加它们是次优的。作为我们实验的结果,我们提供了在自我游戏中设置超参数值的建议。自我游戏迭代的外部循环应该被强调,以支持内部循环。这意味着内部循环的超参数应该设置为较低的值。我们实验的第二个结果涉及优化目标的选择,对此我们也提供了建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Analysis of Hyper-Parameters for AlphaZero-Like Deep Reinforcement Learning
The landmark achievements of AlphaGo Zero have created great research interest into self-play in reinforcement learning. In self-play, Monte Carlo Tree Search is used to train a deep neural network, which is then used itself in tree searches. The training is gov- erned by many hyper-parameters. There has been surprisingly little research on design choices for hyper-parameter values and loss functions, presumably because of the pro- hibitive computational cost to explore the parameter space. In this paper, we investigate 12 hyper-parameters in an AlphaZero-like self-play algorithm and evaluate how these parameters contribute to training. We study them on small games, to achieve meaningful exploration with moderate computational effort. The experimental results show that training is highly sensitive to hyper-parameter choices. Through multi-objective analysis, we identify 4 important hyper-parameters to further assess. To start, we find surprising results where too much training can sometimes lead to lower performance. Our main result is that the number of self-play iterations subsumes MCTS-search sim- ulations, game episodes, and training epochs. The intuition is that these three increase together as self-play iterations increase and that increasing them individually is sub- optimal. As a consequence of our experiments, we provide recommendations on setting hyper-parameter values in self-play. The outer loop of self-play iterations should be em- phasized, in favor of the inner loop. This means hyper-parameters for the inner loop, should be set to lower values. A secondary result of our experiments concerns the choice of optimization goals, for which we also provide recommendations.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信