阿瑟南在计算机奥林匹克竞赛中赢得 16 枚金牌

IF 0.2 4区计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Icga Journal Pub Date : 2024-01-02 DOI:10.3233/icg-230239

Quentin Cohen-Solal, Tristan Cazenave

{"title":"阿瑟南在计算机奥林匹克竞赛中赢得 16 枚金牌","authors":"Quentin Cohen-Solal, Tristan Cazenave","doi":"10.3233/icg-230239","DOIUrl":null,"url":null,"abstract":"Unlike Alpha Zero-like algorithms (Silver et al., 2018), Athénan is based on the Descent framework (Cohen-Solal, 2020). Thus, during the training process, it uses a variant of Unbounded Minimax (Korf and Chickering, 1996) called Descent, instead of Monte Carlo Tree Search, to construct the partial game tree used to determine the best action to play and to collect data for learning. With Descent, at each move, the best sequences of moves are iteratively extended until terminal states. During evaluations, another variant of Unbounded Minimax is used. This variant contains in particular a generic solver and it chooses the safest action to decide between actions. Moreover, contrary to Alpha Zero, Athénan does not use a policy network, only a value network. The actions therefore do not need to be encoded. In addition, unlike the Alpha Zero paradigm, with Athénan all data generated during the searches to determine the best actions to play is used for learning. As a result, much more data is generated per match (Cohen-Solal and Cazenave, 2023), and thus the training is done more quickly and does not require a (massive) parallelization to give good results (contrary to Alpha Zero). Athénan can use end-of-game heuristic evaluations to improve its level of play, such as game score or game length (in order to win quickly and lose slowly). Further improvements are described in (Cohen-Solal, 2020).","PeriodicalId":50395,"journal":{"name":"Icga Journal","volume":"18 7","pages":""},"PeriodicalIF":0.2000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Athénan wins sixteen gold medals at the Computer Olympiad\",\"authors\":\"Quentin Cohen-Solal, Tristan Cazenave\",\"doi\":\"10.3233/icg-230239\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unlike Alpha Zero-like algorithms (Silver et al., 2018), Athénan is based on the Descent framework (Cohen-Solal, 2020). Thus, during the training process, it uses a variant of Unbounded Minimax (Korf and Chickering, 1996) called Descent, instead of Monte Carlo Tree Search, to construct the partial game tree used to determine the best action to play and to collect data for learning. With Descent, at each move, the best sequences of moves are iteratively extended until terminal states. During evaluations, another variant of Unbounded Minimax is used. This variant contains in particular a generic solver and it chooses the safest action to decide between actions. Moreover, contrary to Alpha Zero, Athénan does not use a policy network, only a value network. The actions therefore do not need to be encoded. In addition, unlike the Alpha Zero paradigm, with Athénan all data generated during the searches to determine the best actions to play is used for learning. As a result, much more data is generated per match (Cohen-Solal and Cazenave, 2023), and thus the training is done more quickly and does not require a (massive) parallelization to give good results (contrary to Alpha Zero). Athénan can use end-of-game heuristic evaluations to improve its level of play, such as game score or game length (in order to win quickly and lose slowly). Further improvements are described in (Cohen-Solal, 2020).\",\"PeriodicalId\":50395,\"journal\":{\"name\":\"Icga Journal\",\"volume\":\"18 7\",\"pages\":\"\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2024-01-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Icga Journal\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.3233/icg-230239\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Icga Journal","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/icg-230239","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

与 Alpha Zero 类似算法（Silver 等人，2018 年）不同，Athénan 基于 Descent 框架（Cohen-Solal，2020 年）。因此，在训练过程中，它使用一种名为 "后裔"（Descent）的无界最小值（Unbounded Minimax）变体（Korf 和 Chickering，1996 年），而不是蒙特卡洛树搜索（Monte Carlo Tree Search）来构建部分博弈树，用于确定最佳下棋策略和收集学习数据。使用后裔法时，在每一步棋中，最佳棋步序列都会被迭代扩展，直至终端状态。在评估过程中，会使用无界最小值的另一种变体。这种变体特别包含一个通用求解器，它会选择最安全的棋步来决定不同的棋步。此外，与 Alpha Zero 不同，Athénan 不使用策略网络，只使用值网络。因此，不需要对行动进行编码。此外，与阿尔法零范式不同的是，Athénan 将搜索过程中产生的所有数据用于学习，以确定最佳行动。因此，每场比赛产生的数据要多得多（Cohen-Solal 和 Cazenave，2023 年），因此训练完成得更快，而且不需要（大规模）并行化就能获得良好结果（与 Alpha Zero 相反）。Athénan 可以利用对局结束时的启发式评估来提高对局水平，如对局得分或对局长度（以便快赢慢输）。科恩-索拉勒（Cohen-Solal, 2020）对进一步的改进进行了描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Athénan wins sixteen gold medals at the Computer Olympiad

Unlike Alpha Zero-like algorithms (Silver et al., 2018), Athénan is based on the Descent framework (Cohen-Solal, 2020). Thus, during the training process, it uses a variant of Unbounded Minimax (Korf and Chickering, 1996) called Descent, instead of Monte Carlo Tree Search, to construct the partial game tree used to determine the best action to play and to collect data for learning. With Descent, at each move, the best sequences of moves are iteratively extended until terminal states. During evaluations, another variant of Unbounded Minimax is used. This variant contains in particular a generic solver and it chooses the safest action to decide between actions. Moreover, contrary to Alpha Zero, Athénan does not use a policy network, only a value network. The actions therefore do not need to be encoded. In addition, unlike the Alpha Zero paradigm, with Athénan all data generated during the searches to determine the best actions to play is used for learning. As a result, much more data is generated per match (Cohen-Solal and Cazenave, 2023), and thus the training is done more quickly and does not require a (massive) parallelization to give good results (contrary to Alpha Zero). Athénan can use end-of-game heuristic evaluations to improve its level of play, such as game score or game length (in order to win quickly and lose slowly). Further improvements are described in (Cohen-Solal, 2020).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Icga Journal 工程技术-计算机：软件工程

自引率

25.00%

发文量

期刊介绍： The ICGA Journal provides an international forum for computer games researchers presenting new results on ongoing work. The editors invite contributors to submit papers on all aspects of research related to computers and games. Relevant topics include, but are not limited to: (1) the current state of game-playing programs for classic and modern board and card games (2) the current state of virtual, casual and video games (3) new theoretical developments in game-related research, and (4) general scientific contributions produced by the study of games. Also welcome is research on topics such as: (5) social aspects of computer games (6) cognitive research of how humans play games (7) capture and analysis of game data, and (8) issues related to networked games are invited to submit their contributions.