视频游戏配对非政策评估的堆叠校准

Eric Thibodeau-Laufer, Raul Chandias Ferrari, Li Yao, Olivier Delalleau, Yoshua Bengio
{"title":"视频游戏配对非政策评估的堆叠校准","authors":"Eric Thibodeau-Laufer, Raul Chandias Ferrari, Li Yao, Olivier Delalleau, Yoshua Bengio","doi":"10.1109/CIG.2013.6633642","DOIUrl":null,"url":null,"abstract":"We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied. The objective of the policy is to sequentially form teams of players from those waiting to be matched, in such a way as to produce well-balanced matches. Unfortunately, the available training data comes from a policy that is not known perfectly and that is not stochastic, making it impossible to use methods based on importance weights. Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased. We present a simple calibration procedure that is similar to stacked regression and that removes most of the bias, in the experiments we performed. Data collected during beta tests of Ghost Recon Online, a first person shooter from Ubisoft, were used for the experiments.","PeriodicalId":158902,"journal":{"name":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Stacked calibration of off-policy policy evaluation for video game matchmaking\",\"authors\":\"Eric Thibodeau-Laufer, Raul Chandias Ferrari, Li Yao, Olivier Delalleau, Yoshua Bengio\",\"doi\":\"10.1109/CIG.2013.6633642\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied. The objective of the policy is to sequentially form teams of players from those waiting to be matched, in such a way as to produce well-balanced matches. Unfortunately, the available training data comes from a policy that is not known perfectly and that is not stochastic, making it impossible to use methods based on importance weights. Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased. We present a simple calibration procedure that is similar to stacked regression and that removes most of the bias, in the experiments we performed. Data collected during beta tests of Ghost Recon Online, a first person shooter from Ubisoft, were used for the experiments.\",\"PeriodicalId\":158902,\"journal\":{\"name\":\"2013 IEEE Conference on Computational Inteligence in Games (CIG)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE Conference on Computational Inteligence in Games (CIG)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIG.2013.6633642\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIG.2013.6633642","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们考虑了视频游戏配对推荐系统的工业强度应用,其中非政策政策评估很重要,但标准方法很难应用。该政策的目标是将等待匹配的球员按顺序组成球队,从而产生平衡的比赛。不幸的是,可用的训练数据来自不完全已知的策略,并且不是随机的,因此不可能使用基于重要性权重的方法。此外,我们观察到,当从相同的off-policy数据集通过训练获得估计的奖励函数和策略时,使用估计的奖励函数进行策略评估是有偏差的。我们提出了一个简单的校准程序,类似于堆叠回归,并在我们进行的实验中消除了大部分偏差。在《Ghost Recon Online》(育碧的第一人称射击游戏)的beta测试中收集的数据被用于实验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Stacked calibration of off-policy policy evaluation for video game matchmaking
We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied. The objective of the policy is to sequentially form teams of players from those waiting to be matched, in such a way as to produce well-balanced matches. Unfortunately, the available training data comes from a policy that is not known perfectly and that is not stochastic, making it impossible to use methods based on importance weights. Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased. We present a simple calibration procedure that is similar to stacked regression and that removes most of the bias, in the experiments we performed. Data collected during beta tests of Ghost Recon Online, a first person shooter from Ubisoft, were used for the experiments.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信