Offline Evaluation and Optimization for Interactive Systems

Proceedings of the Eighth ACM International Conference on Web Search and Data Mining Pub Date : 2015-02-02 DOI:10.1145/2684822.2697040

Lihong Li

{"title":"Offline Evaluation and Optimization for Interactive Systems","authors":"Lihong Li","doi":"10.1145/2684822.2697040","DOIUrl":null,"url":null,"abstract":"Evaluating and optimizing an interactive system (like search engines, recommender and advertising systems) from historical data against a predefined online metric is challenging, especially when that metric is computed from user feedback such as clicks and payments. The key challenge is counterfactual in nature: we only observe a user's feedback for actions taken by the system, but we do not know what that user would have reacted to a different action. The golden standard to evaluate such metrics of a user-interacting system is online A/B experiments (a.k.a. randomized controlled experiments), which can be expensive in terms of both time and engineering resources. Offline evaluation/optimization (sometimes referred to as off-policy learning in the literature) thus becomes critical, aiming to evaluate the same metrics without running (many) expensive A/B experiments on live users. One approach to offline evaluation is to build a user model that simulates user behavior (clicks, purchases, etc.) under various contexts, and then evaluate metrics of a system with this simulator. While being straightforward and common in practice, the reliability of such model-based approaches relies heavily on how well the user model is built. Furthermore, it is often difficult to know a priori whether a user model is good enough to be trustable. Recent years have seen a growing interest in another solution to the offline evaluation problem. Using statistical techniques like importance sampling and doubly robust estimation, the approach can give unbiased estimates of metrics for a wide range of problems. It enjoys other benefits as well. For example, it often allows data scientists to obtain a confidence interval for the estimate to quantify the amount of uncertainty; it does not require building user models, so is more robust and easier to apply. All these benefits make the approach particularly attractive to a wide range of problems. Successful applications have been reported in the last few years by some of the industrial leaders. This tutorial gives a review of the basic theory and representative techniques. Applications of these techniques are illustrated through several case studies done at Microsoft and Yahoo!.","PeriodicalId":179443,"journal":{"name":"Proceedings of the Eighth ACM International Conference on Web Search and Data Mining","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Eighth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684822.2697040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Evaluating and optimizing an interactive system (like search engines, recommender and advertising systems) from historical data against a predefined online metric is challenging, especially when that metric is computed from user feedback such as clicks and payments. The key challenge is counterfactual in nature: we only observe a user's feedback for actions taken by the system, but we do not know what that user would have reacted to a different action. The golden standard to evaluate such metrics of a user-interacting system is online A/B experiments (a.k.a. randomized controlled experiments), which can be expensive in terms of both time and engineering resources. Offline evaluation/optimization (sometimes referred to as off-policy learning in the literature) thus becomes critical, aiming to evaluate the same metrics without running (many) expensive A/B experiments on live users. One approach to offline evaluation is to build a user model that simulates user behavior (clicks, purchases, etc.) under various contexts, and then evaluate metrics of a system with this simulator. While being straightforward and common in practice, the reliability of such model-based approaches relies heavily on how well the user model is built. Furthermore, it is often difficult to know a priori whether a user model is good enough to be trustable. Recent years have seen a growing interest in another solution to the offline evaluation problem. Using statistical techniques like importance sampling and doubly robust estimation, the approach can give unbiased estimates of metrics for a wide range of problems. It enjoys other benefits as well. For example, it often allows data scientists to obtain a confidence interval for the estimate to quantify the amount of uncertainty; it does not require building user models, so is more robust and easier to apply. All these benefits make the approach particularly attractive to a wide range of problems. Successful applications have been reported in the last few years by some of the industrial leaders. This tutorial gives a review of the basic theory and representative techniques. Applications of these techniques are illustrated through several case studies done at Microsoft and Yahoo!.

查看原文本刊更多论文

交互式系统的离线评估与优化

根据历史数据和预定义的在线指标评估和优化交互式系统(如搜索引擎、推荐和广告系统)是一项挑战，特别是当该指标是根据用户反馈(如点击和付费)计算时。关键的挑战本质上是反事实的:我们只观察用户对系统采取的行动的反馈，但我们不知道用户对不同的行动会有什么反应。评估用户交互系统的这些指标的黄金标准是在线a /B实验(又名随机对照实验)，这在时间和工程资源方面都是昂贵的。因此，离线评估/优化(有时在文献中称为off-policy learning)变得至关重要，旨在评估相同的指标，而无需在实时用户上运行(许多)昂贵的A/B实验。离线评估的一种方法是建立一个用户模型，模拟各种环境下的用户行为(点击、购买等)，然后用这个模拟器评估系统的指标。尽管这种基于模型的方法在实践中是直接和常见的，但其可靠性在很大程度上依赖于用户模型构建的好坏。此外，通常很难先验地知道用户模型是否足够好到值得信赖。近年来，人们对离线评估问题的另一种解决方案越来越感兴趣。使用统计技术，如重要性抽样和双鲁棒估计，该方法可以为广泛的问题给出指标的无偏估计。它还有其他好处。例如，它通常允许数据科学家获得估计的置信区间，以量化不确定性的数量;它不需要构建用户模型，因此更健壮，更容易应用。所有这些优点使该方法对各种各样的问题特别有吸引力。在过去几年中，一些行业领导者已经报道了成功的应用。本教程回顾了基本理论和代表性技术。这些技术的应用是通过在微软和雅虎做的几个案例研究来说明的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Eighth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量