关于播放列表推荐算法的离线评估决策

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining Pub Date : 2019-01-30 DOI:10.1145/3289600.3291027

Alois Gruson, Praveen Chandar, C. Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, Ben Carterette

{"title":"关于播放列表推荐算法的离线评估决策","authors":"Alois Gruson, Praveen Chandar, C. Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, Ben Carterette","doi":"10.1145/3289600.3291027","DOIUrl":null,"url":null,"abstract":"Evaluating algorithmic recommendations is an important, but difficult, problem. Evaluations conducted offline using data collected from user interactions with an online system often suffer from biases arising from the user interface or the recommendation engine. Online evaluation (A/B testing) can more easily address problems of bias, but depending on setting can be time-consuming and incur risk of negatively impacting the user experience, not to mention that it is generally more difficult when access to a large user base is not taken as granted. A compromise based on \\em counterfactual analysis is to present some subset of online users with recommendation results that have been randomized or otherwise manipulated, log their interactions, and then use those to de-bias offline evaluations on historical data. However, previous work does not offer clear conclusions on how well such methods correlate with and are able to predict the results of online A/B tests. Understanding this is crucial to widespread adoption of new offline evaluation techniques in recommender systems. In this work we present a comparison of offline and online evaluation results for a particular recommendation problem: recommending playlists of tracks to a user looking for music. We describe two different ways to think about de-biasing offline collections for more accurate evaluation. Our results show that, contrary to much of the previous work on this topic, properly-conducted offline experiments do correlate well to A/B test results, and moreover that we can expect an offline evaluation to identify the best candidate systems for online testing with high probability.","PeriodicalId":143253,"journal":{"name":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"83","resultStr":"{\"title\":\"Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms\",\"authors\":\"Alois Gruson, Praveen Chandar, C. Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, Ben Carterette\",\"doi\":\"10.1145/3289600.3291027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Evaluating algorithmic recommendations is an important, but difficult, problem. Evaluations conducted offline using data collected from user interactions with an online system often suffer from biases arising from the user interface or the recommendation engine. Online evaluation (A/B testing) can more easily address problems of bias, but depending on setting can be time-consuming and incur risk of negatively impacting the user experience, not to mention that it is generally more difficult when access to a large user base is not taken as granted. A compromise based on \\\\em counterfactual analysis is to present some subset of online users with recommendation results that have been randomized or otherwise manipulated, log their interactions, and then use those to de-bias offline evaluations on historical data. However, previous work does not offer clear conclusions on how well such methods correlate with and are able to predict the results of online A/B tests. Understanding this is crucial to widespread adoption of new offline evaluation techniques in recommender systems. In this work we present a comparison of offline and online evaluation results for a particular recommendation problem: recommending playlists of tracks to a user looking for music. We describe two different ways to think about de-biasing offline collections for more accurate evaluation. Our results show that, contrary to much of the previous work on this topic, properly-conducted offline experiments do correlate well to A/B test results, and moreover that we can expect an offline evaluation to identify the best candidate systems for online testing with high probability.\",\"PeriodicalId\":143253,\"journal\":{\"name\":\"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"83\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3289600.3291027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3289600.3291027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 83

摘要

评估算法推荐是一个重要但困难的问题。使用从用户与在线系统交互中收集的数据进行离线评估，通常会受到用户界面或推荐引擎产生的偏见的影响。在线评估(A/B测试)可以更容易地解决偏见问题，但取决于设置可能会耗费时间，并产生对用户体验产生负面影响的风险，更不用说当访问大量用户基础不被视为理所当然时，通常会更加困难。基于em反事实分析的一种折衷方法是向一些在线用户子集提供随机化或以其他方式操纵的推荐结果，记录他们的交互，然后使用这些结果来消除对历史数据的离线评估的偏见。然而，之前的工作并没有提供明确的结论，说明这些方法与在线A/B测试的结果有多大的相关性。理解这一点对于在推荐系统中广泛采用新的离线评估技术至关重要。在这项工作中，我们提出了一个特定推荐问题的离线和在线评估结果的比较:向寻找音乐的用户推荐曲目播放列表。我们描述了两种不同的方法来考虑对离线集合进行去偏，以获得更准确的评估。我们的结果表明，与之前关于该主题的大部分工作相反，正确进行的离线实验确实与A/B测试结果有很好的相关性，而且我们可以期望离线评估以高概率确定在线测试的最佳候选系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms

Evaluating algorithmic recommendations is an important, but difficult, problem. Evaluations conducted offline using data collected from user interactions with an online system often suffer from biases arising from the user interface or the recommendation engine. Online evaluation (A/B testing) can more easily address problems of bias, but depending on setting can be time-consuming and incur risk of negatively impacting the user experience, not to mention that it is generally more difficult when access to a large user base is not taken as granted. A compromise based on \em counterfactual analysis is to present some subset of online users with recommendation results that have been randomized or otherwise manipulated, log their interactions, and then use those to de-bias offline evaluations on historical data. However, previous work does not offer clear conclusions on how well such methods correlate with and are able to predict the results of online A/B tests. Understanding this is crucial to widespread adoption of new offline evaluation techniques in recommender systems. In this work we present a comparison of offline and online evaluation results for a particular recommendation problem: recommending playlists of tracks to a user looking for music. We describe two different ways to think about de-biasing offline collections for more accurate evaluation. Our results show that, contrary to much of the previous work on this topic, properly-conducted offline experiments do correlate well to A/B test results, and moreover that we can expect an offline evaluation to identify the best candidate systems for online testing with high probability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量