Estimating Measurement Uncertainty for Information Retrieval Effectiveness Metrics

Alistair Moffat, Falk Scholer, Ziying Yang
{"title":"Estimating Measurement Uncertainty for Information Retrieval Effectiveness Metrics","authors":"Alistair Moffat, Falk Scholer, Ziying Yang","doi":"10.1145/3239572","DOIUrl":null,"url":null,"abstract":"One typical way of building test collections for offline measurement of information retrieval systems is to pool the ranked outputs of different systems down to some chosen depth d and then form relevance judgments for those documents only. Non-pooled documents—ones that did not appear in the top-d sets of any of the contributing systems—are then deemed to be non-relevant for the purposes of evaluating the relative behavior of the systems. In this article, we use RBP-derived residuals to re-examine the reliability of that process. By fitting the RBP parameter φ to maximize similarity between AP- and NDCG-induced system rankings, on the one hand, and RBP-induced rankings, on the other, an estimate can be made as to the potential score uncertainty associated with those two recall-based metrics. We then consider the effect that residual size—as an indicator of possible measurement uncertainty in utility-based metrics—has in connection with recall-based metrics by computing the effect of increasing pool sizes and examining the trends that arise in terms of both metric score and system separability using standard statistical tests. The experimental results show that the confidence levels expressed via the p-values generated by statistical tests are only weakly connected to the size of the residual and to the degree of measurement uncertainty caused by the presence of unjudged documents. Statistical confidence estimates are, however, largely consistent as pooling depths are altered. We therefore recommend that all such experimental results should report, in addition to the outcomes of statistical significance tests, the residual measurements generated by a suitably matched weighted-precision metric, to give a clear indication of measurement uncertainty that arises due to the presence of unjudged documents in test collections with finite pooled judgments.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"200 1","pages":"1 - 22"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3239572","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

One typical way of building test collections for offline measurement of information retrieval systems is to pool the ranked outputs of different systems down to some chosen depth d and then form relevance judgments for those documents only. Non-pooled documents—ones that did not appear in the top-d sets of any of the contributing systems—are then deemed to be non-relevant for the purposes of evaluating the relative behavior of the systems. In this article, we use RBP-derived residuals to re-examine the reliability of that process. By fitting the RBP parameter φ to maximize similarity between AP- and NDCG-induced system rankings, on the one hand, and RBP-induced rankings, on the other, an estimate can be made as to the potential score uncertainty associated with those two recall-based metrics. We then consider the effect that residual size—as an indicator of possible measurement uncertainty in utility-based metrics—has in connection with recall-based metrics by computing the effect of increasing pool sizes and examining the trends that arise in terms of both metric score and system separability using standard statistical tests. The experimental results show that the confidence levels expressed via the p-values generated by statistical tests are only weakly connected to the size of the residual and to the degree of measurement uncertainty caused by the presence of unjudged documents. Statistical confidence estimates are, however, largely consistent as pooling depths are altered. We therefore recommend that all such experimental results should report, in addition to the outcomes of statistical significance tests, the residual measurements generated by a suitably matched weighted-precision metric, to give a clear indication of measurement uncertainty that arises due to the presence of unjudged documents in test collections with finite pooled judgments.
信息检索有效性度量的测量不确定度估计
为信息检索系统的离线度量构建测试集合的一种典型方法是将不同系统的排序输出汇集到选定的深度d,然后仅对这些文档形成相关性判断。对于评估系统的相对行为的目的,非汇集的文档——没有出现在任何贡献系统的top-d集中的文档——被认为是不相关的。在本文中,我们使用rbp衍生的残差来重新检查该过程的可靠性。通过拟合RBP参数φ来最大化AP和ndcg诱导的系统排名与RBP诱导的排名之间的相似性,可以估计出与这两个基于回忆的指标相关的潜在得分不确定性。然后,我们通过计算增加池大小的影响,并使用标准统计测试检查在度量分数和系统可分离性方面出现的趋势,考虑剩余大小(作为基于效用的度量中可能的测量不确定性的指标)与基于召回的度量相关的影响。实验结果表明,通过统计检验生成的p值表示的置信水平与残差的大小和由未判断文件的存在引起的测量不确定度的程度仅弱相关。然而,随着池深的改变,统计置信度估计在很大程度上是一致的。因此,我们建议,除了统计显著性检验的结果外,所有此类实验结果都应报告由适当匹配的加权精度度量产生的残差测量,以明确指出由于在具有有限汇总判断的测试集合中存在未经判断的文件而产生的测量不确定性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信