Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval Pub Date : 2011-07-24 DOI:10.1145/2009916.2010057

Falk Scholer, A. Turpin, M. Sanderson

{"title":"Quantifying test collection quality based on the consistency of relevance judgements","authors":"Falk Scholer, A. Turpin, M. Sanderson","doi":"10.1145/2009916.2010057","DOIUrl":null,"url":null,"abstract":"Relevance assessments are a key component for test collection-based evaluation of information retrieval systems. This paper reports on a feature of such collections that is used as a form of ground truth data to allow analysis of human assessment error. A wide range of test collections are retrospectively examined to determine how accurately assessors judge the relevance of documents. Our results demonstrate a high level of inconsistency across the collections studied. The level of irregularity is shown to vary across topics, with some showing a very high level of assessment error. We investigate possible influences on the error, and demonstrate that inconsistency in judging increases with time. While the level of detail in a topic specification does not appear to influence the errors that assessors make, judgements are significantly affected by the decisions made on previously seen similar documents. Assessors also display an assessment inertia. Alternate approaches to generating relevance judgements appear to reduce errors. A further investigation of the way that retrieval systems are ranked using sets of relevance judgements produced early and late in the judgement process reveals a consistent influence measured across the majority of examined test collections. We conclude that there is a clear value in examining, even inserting, ground truth data in test collections, and propose ways to help minimise the sources of inconsistency when creating future test collections.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"81","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2009916.2010057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 81

Abstract

Relevance assessments are a key component for test collection-based evaluation of information retrieval systems. This paper reports on a feature of such collections that is used as a form of ground truth data to allow analysis of human assessment error. A wide range of test collections are retrospectively examined to determine how accurately assessors judge the relevance of documents. Our results demonstrate a high level of inconsistency across the collections studied. The level of irregularity is shown to vary across topics, with some showing a very high level of assessment error. We investigate possible influences on the error, and demonstrate that inconsistency in judging increases with time. While the level of detail in a topic specification does not appear to influence the errors that assessors make, judgements are significantly affected by the decisions made on previously seen similar documents. Assessors also display an assessment inertia. Alternate approaches to generating relevance judgements appear to reduce errors. A further investigation of the way that retrieval systems are ranked using sets of relevance judgements produced early and late in the judgement process reveals a consistent influence measured across the majority of examined test collections. We conclude that there is a clear value in examining, even inserting, ground truth data in test collections, and propose ways to help minimise the sources of inconsistency when creating future test collections.

查看原文本刊更多论文

基于相关判断的一致性来量化测试集合的质量

相关性评估是基于测试集的信息检索系统评估的关键组成部分。本文报告了这种集合的一个特征，它被用作一种形式的真实数据，以允许分析人类的评估错误。对广泛的测试集合进行回顾性检查，以确定评估人员判断文件相关性的准确性。我们的结果表明，所研究的藏品之间存在高度的不一致性。不同主题的不规范程度有所不同，其中一些显示出非常高的评估错误。我们研究了可能对误差的影响，并证明了判断的不一致性随着时间的推移而增加。虽然专题说明的详细程度似乎不会影响评估员所犯的错误，但对以前见过的类似文件所作的决定会对判断产生重大影响。评估人员也显示出评估惯性。生成相关性判断的替代方法似乎可以减少错误。对检索系统排序方式的进一步调查显示，在判断过程的早期和后期产生的一系列相关性判断显示了在大多数测试集合中测量到的一致影响。我们得出的结论是，在测试集合中检查甚至插入地面真实数据具有明确的价值，并提出了在创建未来测试集合时帮助最大限度地减少不一致来源的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

自引率

0.00%

发文量