{"title":"Descent在计算机奥林匹克竞赛中获得五枚金牌","authors":"Quentin Cohen-Solal, T. Cazenave","doi":"10.3233/icg-210192","DOIUrl":null,"url":null,"abstract":"Unlike AlphaZero-like algorithms (Silver et al., 2018), the Descent framework uses a variant of Unbounded Minimax (Korf and Chickering, 1996), instead of Monte Carlo Tree Search, to construct the partial game tree used to determine the best action to play and to collect data for learning. During training, at each move, the best sequences of moves are iteratively extended until terminal states. During evaluations, the safest action is chosen (after that the best sequences of moves are iteratively extended each until a leaf state is reached). Moreover, it also does not use a policy network, only a value network. The actions therefore do not need to be encoded. Unlike the AlphaZero paradigm, with Descent all data generated during the searches to determine the best actions to play is used for learning. As a result, much more data is generated per game, and thus the training is done more quickly and does not require a (massive) parallelization to give good results (contrary to AlphaZero). It can use end-of-game heuristic evaluation to improve its level of play faster, such as game score or game length (in order to win quickly and lose slowly).","PeriodicalId":14829,"journal":{"name":"J. Int. Comput. Games Assoc.","volume":"60 3","pages":"132-134"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Descent wins five gold medals at the Computer Olympiad\",\"authors\":\"Quentin Cohen-Solal, T. Cazenave\",\"doi\":\"10.3233/icg-210192\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unlike AlphaZero-like algorithms (Silver et al., 2018), the Descent framework uses a variant of Unbounded Minimax (Korf and Chickering, 1996), instead of Monte Carlo Tree Search, to construct the partial game tree used to determine the best action to play and to collect data for learning. During training, at each move, the best sequences of moves are iteratively extended until terminal states. During evaluations, the safest action is chosen (after that the best sequences of moves are iteratively extended each until a leaf state is reached). Moreover, it also does not use a policy network, only a value network. The actions therefore do not need to be encoded. Unlike the AlphaZero paradigm, with Descent all data generated during the searches to determine the best actions to play is used for learning. As a result, much more data is generated per game, and thus the training is done more quickly and does not require a (massive) parallelization to give good results (contrary to AlphaZero). It can use end-of-game heuristic evaluation to improve its level of play faster, such as game score or game length (in order to win quickly and lose slowly).\",\"PeriodicalId\":14829,\"journal\":{\"name\":\"J. Int. Comput. Games Assoc.\",\"volume\":\"60 3\",\"pages\":\"132-134\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Int. Comput. Games Assoc.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/icg-210192\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Int. Comput. Games Assoc.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/icg-210192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
与类似alphazero的算法(Silver et al., 2018)不同,Descent框架使用Unbounded Minimax (Korf and Chickering, 1996)的变体,而不是Monte Carlo Tree Search,来构建用于确定最佳操作和收集数据以供学习的部分博弈树。在训练过程中,每次移动时,迭代扩展最佳移动序列,直到最终状态。在评估过程中,选择最安全的动作(之后,每次迭代扩展最佳的移动序列,直到达到叶子状态)。此外,它也不使用政策网络,只使用价值网络。因此,不需要对操作进行编码。与AlphaZero范例不同的是,Descent在搜索过程中生成的所有数据都用于学习,以确定最佳操作。因此,每场比赛产生更多的数据,因此训练可以更快地完成,并且不需要(大规模)并行化来获得良好的结果(与AlphaZero相反)。它可以使用游戏结束启发式评估来更快地提高游戏水平,例如游戏分数或游戏长度(为了快速获胜和缓慢失败)。
Descent wins five gold medals at the Computer Olympiad
Unlike AlphaZero-like algorithms (Silver et al., 2018), the Descent framework uses a variant of Unbounded Minimax (Korf and Chickering, 1996), instead of Monte Carlo Tree Search, to construct the partial game tree used to determine the best action to play and to collect data for learning. During training, at each move, the best sequences of moves are iteratively extended until terminal states. During evaluations, the safest action is chosen (after that the best sequences of moves are iteratively extended each until a leaf state is reached). Moreover, it also does not use a policy network, only a value network. The actions therefore do not need to be encoded. Unlike the AlphaZero paradigm, with Descent all data generated during the searches to determine the best actions to play is used for learning. As a result, much more data is generated per game, and thus the training is done more quickly and does not require a (massive) parallelization to give good results (contrary to AlphaZero). It can use end-of-game heuristic evaluation to improve its level of play faster, such as game score or game length (in order to win quickly and lose slowly).