Retrieval Evaluation Measures that Agree with Users’ SERP Preferences

ACM Transactions on Information Systems (TOIS) Pub Date : 2020-12-31 DOI:10.1145/3431813

T. Sakai, Zhaohao Zeng

{"title":"Retrieval Evaluation Measures that Agree with Users’ SERP Preferences","authors":"T. Sakai, Zhaohao Zeng","doi":"10.1145/3431813","DOIUrl":null,"url":null,"abstract":"We examine the “goodness” of ranked retrieval evaluation measures in terms of how well they align with users’ Search Engine Result Page (SERP) preferences for web search. The SERP preferences cover 1,127 topic-SERP-SERP triplets extracted from the NTCIR-9 INTENT task, reflecting the views of 15 different assessors. Each assessor made two SERP preference judgements for each triplet: one in terms of relevance and the other in terms of diversity. For each evaluation measure, we compute the Agreement Rate (AR) of each triplet: the proportion of assessors that agree with the measure’s SERP preference. We then compare the mean ARs of the measures as well as those of best/median/worst assessors using Tukey HSD tests. Our first experiment compares traditional ranked retrieval measures based on the SERP relevance preferences: we find that normalised Discounted Cumulative Gain (nDCG) and intentwise Rank-biased Utility (iRBU) perform best in that they are the only measures that are statistically indistinguishable from our best assessor; nDCG also statistically significantly outperforms our median assessor. Our second experiment utilises 119,646 document preferences that we collected for a subset of the above topic-SERP-SERP triplets (containing 894 triplets) to compare preference-based evaluation measures as well as traditional ones. Again, we evaluate them based on the SERP relevance preferences. The results suggest that measures such as wpref5 are the most promising among the preference-based measures considered, although they underperform the best traditional measures such as nDCG on average. Our third experiment compares diversified search measures based on the SERP diversity preferences as well as the SERP relevance preferences, and it shows that D♯-measures are clearly the most reliable: in particular, D♯-nDCG and D♯-RBP statistically significantly outperform the median assessor and all intent-aware measures; they also outperform the recently proposed RBU on average. Also, in terms of agreement with SERP diversity preferences, D♯-nDCG statistically significantly outperforms RBU. Hence, if IR researchers want to use evaluation measures that align well with users’ SERP preferences, then we recommend nDCG and iRBU for traditional search, and D♯-measures such as D♯-nDCG for diversified search. As for document preference-based measures that we have examined, we do not have a strong reason to recommended them over traditional measures like nDCG, since they align slightly less well with users’ SERP preferences despite their quadratic assessment cost.","PeriodicalId":6934,"journal":{"name":"ACM Transactions on Information Systems (TOIS)","volume":"407 1","pages":"1 - 35"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems (TOIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431813","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

We examine the “goodness” of ranked retrieval evaluation measures in terms of how well they align with users’ Search Engine Result Page (SERP) preferences for web search. The SERP preferences cover 1,127 topic-SERP-SERP triplets extracted from the NTCIR-9 INTENT task, reflecting the views of 15 different assessors. Each assessor made two SERP preference judgements for each triplet: one in terms of relevance and the other in terms of diversity. For each evaluation measure, we compute the Agreement Rate (AR) of each triplet: the proportion of assessors that agree with the measure’s SERP preference. We then compare the mean ARs of the measures as well as those of best/median/worst assessors using Tukey HSD tests. Our first experiment compares traditional ranked retrieval measures based on the SERP relevance preferences: we find that normalised Discounted Cumulative Gain (nDCG) and intentwise Rank-biased Utility (iRBU) perform best in that they are the only measures that are statistically indistinguishable from our best assessor; nDCG also statistically significantly outperforms our median assessor. Our second experiment utilises 119,646 document preferences that we collected for a subset of the above topic-SERP-SERP triplets (containing 894 triplets) to compare preference-based evaluation measures as well as traditional ones. Again, we evaluate them based on the SERP relevance preferences. The results suggest that measures such as wpref5 are the most promising among the preference-based measures considered, although they underperform the best traditional measures such as nDCG on average. Our third experiment compares diversified search measures based on the SERP diversity preferences as well as the SERP relevance preferences, and it shows that D♯-measures are clearly the most reliable: in particular, D♯-nDCG and D♯-RBP statistically significantly outperform the median assessor and all intent-aware measures; they also outperform the recently proposed RBU on average. Also, in terms of agreement with SERP diversity preferences, D♯-nDCG statistically significantly outperforms RBU. Hence, if IR researchers want to use evaluation measures that align well with users’ SERP preferences, then we recommend nDCG and iRBU for traditional search, and D♯-measures such as D♯-nDCG for diversified search. As for document preference-based measures that we have examined, we do not have a strong reason to recommended them over traditional measures like nDCG, since they align slightly less well with users’ SERP preferences despite their quadratic assessment cost.

查看原文本刊更多论文

符合用户SERP偏好的检索评价措施

我们根据排序检索评估措施与用户搜索引擎结果页面(SERP)偏好的一致程度来检验排名检索评估措施的“好坏”。SERP偏好涵盖了从ntir -9 INTENT任务中提取的1,127个主题-SERP-SERP三元组，反映了15个不同评估者的观点。每个评估员对每个三元组做出两个SERP偏好判断:一个是根据相关性，另一个是根据多样性。对于每个评估措施，我们计算每个三元组的协议率(AR):同意该措施的SERP偏好的评估者的比例。然后，我们使用Tukey HSD测试比较测量的平均ar以及最佳/中位数/最差评估者的ar。我们的第一个实验比较了基于SERP相关性偏好的传统排名检索度量:我们发现归一化贴现累积增益(nDCG)和故意排名偏倚效用(iRBU)表现最好，因为它们是与我们最好的评估器在统计上无法区分的唯一度量;nDCG在统计上也显著优于我们的中位评估器。我们的第二个实验利用我们为上述主题- serp - serp三元组(包含894个三元组)的一个子集收集的119,646个文档偏好来比较基于偏好的评估方法和传统的评估方法。同样，我们根据SERP相关偏好来评估它们。结果表明，在考虑的基于偏好的措施中，wpref5等措施是最有希望的，尽管它们的平均表现不如nDCG等最佳传统措施。我们的第三个实验比较了基于SERP多样性偏好和SERP相关性偏好的多样化搜索度量，结果表明，D♯-nDCG和D♯-RBP显然是最可靠的:特别是，D♯-nDCG和D♯-RBP在统计上显著优于中位数评估器和所有意图感知度量;它们的平均表现也优于最近提出的RBU。此外，在与SERP多样性偏好的一致性方面，d# -nDCG在统计上显著优于RBU。因此，如果IR研究人员希望使用与用户SERP偏好一致的评估指标，那么我们建议将nDCG和iRBU用于传统搜索，而将D♯-nDCG等D♯-nDCG用于多样化搜索。至于我们已经检查过的基于文档偏好的度量，我们没有强有力的理由推荐它们优于传统的度量，如nDCG，因为它们与用户的SERP偏好的一致性略差，尽管它们的评估成本是二次的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Information Systems (TOIS)

自引率

0.00%

发文量