视频游戏配对非政策评估的堆叠校准

2013 IEEE Conference on Computational Inteligence in Games (CIG) Pub Date : 2013-10-17 DOI:10.1109/CIG.2013.6633642

Eric Thibodeau-Laufer, Raul Chandias Ferrari, Li Yao, Olivier Delalleau, Yoshua Bengio

{"title":"视频游戏配对非政策评估的堆叠校准","authors":"Eric Thibodeau-Laufer, Raul Chandias Ferrari, Li Yao, Olivier Delalleau, Yoshua Bengio","doi":"10.1109/CIG.2013.6633642","DOIUrl":null,"url":null,"abstract":"We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied. The objective of the policy is to sequentially form teams of players from those waiting to be matched, in such a way as to produce well-balanced matches. Unfortunately, the available training data comes from a policy that is not known perfectly and that is not stochastic, making it impossible to use methods based on importance weights. Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased. We present a simple calibration procedure that is similar to stacked regression and that removes most of the bias, in the experiments we performed. Data collected during beta tests of Ghost Recon Online, a first person shooter from Ubisoft, were used for the experiments.","PeriodicalId":158902,"journal":{"name":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Stacked calibration of off-policy policy evaluation for video game matchmaking\",\"authors\":\"Eric Thibodeau-Laufer, Raul Chandias Ferrari, Li Yao, Olivier Delalleau, Yoshua Bengio\",\"doi\":\"10.1109/CIG.2013.6633642\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied. The objective of the policy is to sequentially form teams of players from those waiting to be matched, in such a way as to produce well-balanced matches. Unfortunately, the available training data comes from a policy that is not known perfectly and that is not stochastic, making it impossible to use methods based on importance weights. Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased. We present a simple calibration procedure that is similar to stacked regression and that removes most of the bias, in the experiments we performed. Data collected during beta tests of Ghost Recon Online, a first person shooter from Ubisoft, were used for the experiments.\",\"PeriodicalId\":158902,\"journal\":{\"name\":\"2013 IEEE Conference on Computational Inteligence in Games (CIG)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE Conference on Computational Inteligence in Games (CIG)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIG.2013.6633642\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Conference on Computational Inteligence in Games (CIG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIG.2013.6633642","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们考虑了视频游戏配对推荐系统的工业强度应用，其中非政策政策评估很重要，但标准方法很难应用。该政策的目标是将等待匹配的球员按顺序组成球队，从而产生平衡的比赛。不幸的是，可用的训练数据来自不完全已知的策略，并且不是随机的，因此不可能使用基于重要性权重的方法。此外，我们观察到，当从相同的off-policy数据集通过训练获得估计的奖励函数和策略时，使用估计的奖励函数进行策略评估是有偏差的。我们提出了一个简单的校准程序，类似于堆叠回归，并在我们进行的实验中消除了大部分偏差。在《Ghost Recon Online》(育碧的第一人称射击游戏)的beta测试中收集的数据被用于实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Stacked calibration of off-policy policy evaluation for video game matchmaking

We consider an industrial strength application of recommendation systems for video-game matchmaking in which off-policy policy evaluation is important but where standard approaches can hardly be applied. The objective of the policy is to sequentially form teams of players from those waiting to be matched, in such a way as to produce well-balanced matches. Unfortunately, the available training data comes from a policy that is not known perfectly and that is not stochastic, making it impossible to use methods based on importance weights. Furthermore, we observe that when the estimated reward function and the policy are obtained by training from the same off-policy dataset, the policy evaluation using the estimated reward function is biased. We present a simple calibration procedure that is similar to stacked regression and that removes most of the bias, in the experiments we performed. Data collected during beta tests of Ghost Recon Online, a first person shooter from Ubisoft, were used for the experiments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE Conference on Computational Inteligence in Games (CIG)

自引率

0.00%

发文量