Targeted Search Control in AlphaZero for Effective Policy Improvement

Adaptive Agents and Multi-Agent Systems Pub Date : 2023-02-23 DOI:10.48550/arXiv.2302.12359

Alexandre Trudeau, Michael H. Bowling

{"title":"Targeted Search Control in AlphaZero for Effective Policy Improvement","authors":"Alexandre Trudeau, Michael H. Bowling","doi":"10.48550/arXiv.2302.12359","DOIUrl":null,"url":null,"abstract":"AlphaZero is a self-play reinforcement learning algorithm that achieves superhuman play in chess, shogi, and Go via policy iteration. To be an effective policy improvement operator, AlphaZero's search requires accurate value estimates for the states appearing in its search tree. AlphaZero trains upon self-play matches beginning from the initial state of a game and only samples actions over the first few moves, limiting its exploration of states deeper in the game tree. We introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Beginning self-play trajectories from varied starting states enables Go-Exploit to more effectively explore the game tree and to learn a value function that generalizes better. Producing shorter self-play trajectories allows Go-Exploit to train upon more independent value targets, improving value training. Finally, the exploration inherent in Go-Exploit reduces its need for exploratory actions, enabling it to train under more exploitative policies. In the games of Connect Four and 9x9 Go, we show that Go-Exploit learns with a greater sample efficiency than standard AlphaZero, resulting in stronger performance against reference opponents and in head-to-head play. We also compare Go-Exploit to KataGo, a more sample efficient reimplementation of AlphaZero, and demonstrate that Go-Exploit has a more effective search control strategy. Furthermore, Go-Exploit's sample efficiency improves when KataGo's other innovations are incorporated.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Adaptive Agents and Multi-Agent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2302.12359","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

AlphaZero is a self-play reinforcement learning algorithm that achieves superhuman play in chess, shogi, and Go via policy iteration. To be an effective policy improvement operator, AlphaZero's search requires accurate value estimates for the states appearing in its search tree. AlphaZero trains upon self-play matches beginning from the initial state of a game and only samples actions over the first few moves, limiting its exploration of states deeper in the game tree. We introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Beginning self-play trajectories from varied starting states enables Go-Exploit to more effectively explore the game tree and to learn a value function that generalizes better. Producing shorter self-play trajectories allows Go-Exploit to train upon more independent value targets, improving value training. Finally, the exploration inherent in Go-Exploit reduces its need for exploratory actions, enabling it to train under more exploitative policies. In the games of Connect Four and 9x9 Go, we show that Go-Exploit learns with a greater sample efficiency than standard AlphaZero, resulting in stronger performance against reference opponents and in head-to-head play. We also compare Go-Exploit to KataGo, a more sample efficient reimplementation of AlphaZero, and demonstrate that Go-Exploit has a more effective search control strategy. Furthermore, Go-Exploit's sample efficiency improves when KataGo's other innovations are incorporated.

查看原文本刊更多论文

AlphaZero的目标搜索控制，以实现有效的策略改进

AlphaZero是一种自我对弈强化学习算法，通过策略迭代在国际象棋、将棋和围棋中实现超人的对弈。为了成为一个有效的策略改进算子，AlphaZero的搜索需要对其搜索树中出现的状态进行准确的值估计。AlphaZero从游戏的初始状态开始进行自我对弈训练，并且只在最初的几个移动中采样操作，限制了它对游戏树中更深层次状态的探索。我们介绍了Go-Exploit，一种新的AlphaZero搜索控制策略。Go-Exploit从兴趣状态的存档中采样其自我游戏轨迹的开始状态。从不同的初始状态开始自我游戏轨迹可以让Go-Exploit更有效地探索游戏树，并学习更好的泛化价值函数。生成更短的自我游戏轨迹允许Go-Exploit在更独立的价值目标上进行训练，从而改进价值训练。最后，Go-Exploit固有的探索能力减少了它对探索行为的需求，使其能够在更具剥削性的政策下进行训练。在Connect 4和9x9围棋的游戏中，我们发现Go- exploit比标准AlphaZero具有更高的样本效率，从而在对抗参考对手和正面比赛中表现更强。我们还将Go-Exploit与KataGo进行了比较，KataGo是AlphaZero的更高效的重新实现，并证明Go-Exploit具有更有效的搜索控制策略。此外，当结合KataGo的其他创新时，Go-Exploit的取样效率也会提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Adaptive Agents and Multi-Agent Systems

自引率

0.00%

发文量