On the measurement of test collection reliability

Julián Urbano, M. Marrero, Diego Martín
{"title":"On the measurement of test collection reliability","authors":"Julián Urbano, M. Marrero, Diego Martín","doi":"10.1145/2484028.2484038","DOIUrl":null,"url":null,"abstract":"The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484028.2484038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 42

Abstract

The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.
关于测试采集可靠性的测量
测试集合的可靠性与它包含的查询数量成正比。但是建立一个包含许多查询的集合是昂贵的,因此研究人员必须在可靠性和成本之间找到平衡。之前关于测试收集可靠性测量的工作依赖于基于数据的方法,这些方法考虑了随机的假设情景,并提供了诸如互换利率和肯德尔tau相关性等指标。概括性理论是建立在方差分析基础上的一种替代理论,它提供了基于统计理论的可靠性指标。然而,这些可靠性指标在实践中很难解释,因为它们不对应于众所周知的指标,如肯德尔tau相关。我们根据40多个TREC收集的数据建立了这些关系,从而填补了概括性理论在实际解释中的空白。我们还回顾了这些指标的计算,并表明它们非常依赖于所使用的系统和查询的样本,以至于达到一定程度的可靠性所需的查询数量可能会发生数量级的变化。我们讨论了这些统计的置信区间的计算,提供了一个更可靠的工具来测量测试集合的可靠性。考虑到所有这些结果,我们回顾了大量的TREC测试集合,认为它们可能不像普遍接受的那样可靠,而且通常选择的50个查询甚至不足以实现稳定的排名。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信