Relevance Assessments for Web Search Evaluation: Should We Randomise or Prioritise the Pooled Documents?

ACM Transactions on Information Systems (TOIS) Pub Date : 2022-01-12 DOI:10.1145/3494833

T. Sakai, Sijie Tao, Zhaohao Zeng

{"title":"Relevance Assessments for Web Search Evaluation: Should We Randomise or Prioritise the Pooled Documents?","authors":"T. Sakai, Sijie Tao, Zhaohao Zeng","doi":"10.1145/3494833","DOIUrl":null,"url":null,"abstract":"In the context of depth-k pooling for constructing web search test collections, we compare two approaches to ordering pooled documents for relevance assessors: The prioritisation strategy (PRI) used widely at NTCIR, and the simple randomisation strategy (RND). In order to address research questions regarding PRI and RND, we have constructed and released the WWW3E8 dataset, which contains eight independent relevance labels for 32,375 topic-document pairs, i.e., a total of 259,000 labels. Four of the eight relevance labels were obtained from PRI-based pools; the other four were obtained from RND-based pools. Using WWW3E8, we compare PRI and RND in terms of inter-assessor agreement, system ranking agreement, and robustness to new systems that did not contribute to the pools. We also utilise an assessor activity log we obtained as a byproduct of WWW3E8 to compare the two strategies in terms of assessment efficiency. Our main findings are: (a) The presentation order has no substantial impact on assessment efficiency; (b) While the presentation order substantially affects which documents are judged (highly) relevant, the difference between the inter-assessor agreement under the PRI condition and that under the RND condition is of no practical significance; (c) Different system rankings under the PRI condition are substantially more similar to one another than those under the RND condition; and (d) PRI-based relevance assessment files (qrels) are substantially and statistically significantly more robust to new systems than RND-based ones. Finding (d) suggests that PRI helps the assessors identify relevant documents that affect the evaluation of many existing systems, including those that did not contribute to the pools. Hence, if researchers need to evaluate their current IR systems using legacy IR test collections, we recommend the use of those constructed using the PRI approach unless they have a good reason to believe that their systems retrieve relevant documents that are vastly different from the pooled documents. While this robustness of PRI may also mean that the PRI-based pools are biased against future systems that retrieve highly novel relevant documents, one should note that there is no evidence that RND is any better in this respect.","PeriodicalId":6934,"journal":{"name":"ACM Transactions on Information Systems (TOIS)","volume":"90 1","pages":"1 - 35"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems (TOIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3494833","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

In the context of depth-k pooling for constructing web search test collections, we compare two approaches to ordering pooled documents for relevance assessors: The prioritisation strategy (PRI) used widely at NTCIR, and the simple randomisation strategy (RND). In order to address research questions regarding PRI and RND, we have constructed and released the WWW3E8 dataset, which contains eight independent relevance labels for 32,375 topic-document pairs, i.e., a total of 259,000 labels. Four of the eight relevance labels were obtained from PRI-based pools; the other four were obtained from RND-based pools. Using WWW3E8, we compare PRI and RND in terms of inter-assessor agreement, system ranking agreement, and robustness to new systems that did not contribute to the pools. We also utilise an assessor activity log we obtained as a byproduct of WWW3E8 to compare the two strategies in terms of assessment efficiency. Our main findings are: (a) The presentation order has no substantial impact on assessment efficiency; (b) While the presentation order substantially affects which documents are judged (highly) relevant, the difference between the inter-assessor agreement under the PRI condition and that under the RND condition is of no practical significance; (c) Different system rankings under the PRI condition are substantially more similar to one another than those under the RND condition; and (d) PRI-based relevance assessment files (qrels) are substantially and statistically significantly more robust to new systems than RND-based ones. Finding (d) suggests that PRI helps the assessors identify relevant documents that affect the evaluation of many existing systems, including those that did not contribute to the pools. Hence, if researchers need to evaluate their current IR systems using legacy IR test collections, we recommend the use of those constructed using the PRI approach unless they have a good reason to believe that their systems retrieve relevant documents that are vastly different from the pooled documents. While this robustness of PRI may also mean that the PRI-based pools are biased against future systems that retrieve highly novel relevant documents, one should note that there is no evidence that RND is any better in this respect.

查看原文本刊更多论文

网络搜索评估的相关性评估:我们应该随机化还是优先化汇集的文档?

在构建web搜索测试集合的深度k池的背景下，我们比较了两种为相关性评估者排序池文档的方法:优先级策略(PRI)和简单随机化策略(RND)。为了解决关于PRI和RND的研究问题，我们构建并发布了WWW3E8数据集，该数据集包含8个独立的相关标签，涉及32,375个主题-文档对，即总共259,000个标签。八个相关标签中的四个是从基于pri的池中获得的;其他四个是从基于rnd的池中获得的。使用WWW3E8，我们比较了PRI和RND在评估者间一致性、系统排名一致性和对不参与池的新系统的鲁棒性方面的差异。我们还利用作为WWW3E8的副产品获得的评估者活动日志来比较评估效率方面的两种策略。我们的主要发现是:(a)列报顺序对评估效率没有实质性影响;(b)虽然列报顺序对判定哪些文件(高度)相关有重大影响，但优先次序条件下的分摊员间协议与重新编制条件下的协议之间的差异没有实际意义;(c) PRI条件下的不同系统排名比RND条件下的系统排名彼此之间的相似性要大得多;(d)基于pri的相关性评估文件(qrel)对新系统的鲁棒性在实质上和统计上显著高于基于rnd的评估文件。发现(d)表明，PRI有助于评估人员识别影响许多现有系统评估的相关文件，包括那些没有对资源池做出贡献的系统。因此，如果研究人员需要使用遗留的IR测试集合来评估他们当前的IR系统，我们建议使用那些使用PRI方法构建的系统，除非他们有充分的理由相信他们的系统检索的相关文档与汇集的文档有很大的不同。虽然PRI的这种鲁棒性也可能意味着基于PRI的池对检索高度新颖的相关文档的未来系统有偏见，但应该注意的是，没有证据表明RND在这方面更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Information Systems (TOIS)

自引率

0.00%

发文量