简化信息检索测试集合

RIAO Conference Pub Date : 2010-04-28 DOI:10.5555/1937055.1937066

T. Sakai, T. Mitamura

{"title":"简化信息检索测试集合","authors":"T. Sakai, T. Mitamura","doi":"10.5555/1937055.1937066","DOIUrl":null,"url":null,"abstract":"Constructing large-scale test collections is costly and time-consuming, and a few relevance assessment methods have been proposed for constructing \"minimal\" information retrieval test collections that may still provide reliable experimental results. In contrast to building up such test collections, we take existing test collections constructed through the traditional pooling approach and empirically investigate whether they can be \"boiled down.\" More specifically, we report on experiments with test collections from both NT-CIR and TREC to investigate the effect of reducing both the topic set size and the pool depth on the outcome of a statistical significance test between two systems, starting with (approximately) 100 topics and depth-100 pools. We define cost (of manual relevance assessment) as the pool depth multiplied by the topic set size, and error as a system pair whose outcome of statistical significance testing differs from the original result based on the full test collection. Our main findings are: (a) Cost and the number of errors are negatively correlated, and any attempt at substantially reducing cost introduces some errors; (b) The NTCIR-7 IR4QA and the TREC 2004 robust track test collections all yield a comparable and considerable number of errors in response to cost reduction, and this is true despite the fact that the TREC relevance assessments relied on more than twice as many runs as the NTCIR ones; (c) Using 100 topics with depth-30 pools generally yields fewer errors than using 30 topics with depth-100 pools; and (d) Even with depth-100 pools, using fewer than 100 topics results in false alarms, i.e. two systems are declared significantly different even though the full topic set would declare otherwise.","PeriodicalId":120472,"journal":{"name":"RIAO Conference","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Boiling down information retrieval test collections\",\"authors\":\"T. Sakai, T. Mitamura\",\"doi\":\"10.5555/1937055.1937066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Constructing large-scale test collections is costly and time-consuming, and a few relevance assessment methods have been proposed for constructing \\\"minimal\\\" information retrieval test collections that may still provide reliable experimental results. In contrast to building up such test collections, we take existing test collections constructed through the traditional pooling approach and empirically investigate whether they can be \\\"boiled down.\\\" More specifically, we report on experiments with test collections from both NT-CIR and TREC to investigate the effect of reducing both the topic set size and the pool depth on the outcome of a statistical significance test between two systems, starting with (approximately) 100 topics and depth-100 pools. We define cost (of manual relevance assessment) as the pool depth multiplied by the topic set size, and error as a system pair whose outcome of statistical significance testing differs from the original result based on the full test collection. Our main findings are: (a) Cost and the number of errors are negatively correlated, and any attempt at substantially reducing cost introduces some errors; (b) The NTCIR-7 IR4QA and the TREC 2004 robust track test collections all yield a comparable and considerable number of errors in response to cost reduction, and this is true despite the fact that the TREC relevance assessments relied on more than twice as many runs as the NTCIR ones; (c) Using 100 topics with depth-30 pools generally yields fewer errors than using 30 topics with depth-100 pools; and (d) Even with depth-100 pools, using fewer than 100 topics results in false alarms, i.e. two systems are declared significantly different even though the full topic set would declare otherwise.\",\"PeriodicalId\":120472,\"journal\":{\"name\":\"RIAO Conference\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"RIAO Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5555/1937055.1937066\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"RIAO Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5555/1937055.1937066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

构建大规模的测试集合是昂贵和耗时的，并且已经提出了一些相关评估方法来构建“最小”的信息检索测试集合，这些测试集合仍然可以提供可靠的实验结果。与构建这样的测试集合相比，我们采用通过传统的池化方法构建的现有测试集合，并根据经验调查它们是否可以“浓缩”。更具体地说，我们报告了来自NT-CIR和TREC的测试集合的实验，以研究减少主题集大小和池深度对两个系统之间统计显著性检验结果的影响，从(大约)100个主题和深度-100个池开始。我们将人工相关性评估的成本定义为池深度乘以主题集大小，将误差定义为系统对，其统计显著性测试的结果与基于完整测试集的原始结果不同。我们的主要发现是:(a)费用和错误数目负相关，任何大幅度减少费用的企图都会引起一些错误;(b) ntcirr -7 IR4QA和TREC 2004稳健轨道测试集合在响应成本降低时都产生了相当数量的错误，尽管TREC相关性评估依赖于NTCIR运行次数的两倍以上，但这是事实;(c)使用深度为30的100个主题通常比使用深度为100的30个主题产生更少的错误;(d)即使使用深度为100的池，使用少于100个主题也会导致假警报，即两个系统被声明为明显不同，即使整个主题集会声明不同。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Boiling down information retrieval test collections

Constructing large-scale test collections is costly and time-consuming, and a few relevance assessment methods have been proposed for constructing "minimal" information retrieval test collections that may still provide reliable experimental results. In contrast to building up such test collections, we take existing test collections constructed through the traditional pooling approach and empirically investigate whether they can be "boiled down." More specifically, we report on experiments with test collections from both NT-CIR and TREC to investigate the effect of reducing both the topic set size and the pool depth on the outcome of a statistical significance test between two systems, starting with (approximately) 100 topics and depth-100 pools. We define cost (of manual relevance assessment) as the pool depth multiplied by the topic set size, and error as a system pair whose outcome of statistical significance testing differs from the original result based on the full test collection. Our main findings are: (a) Cost and the number of errors are negatively correlated, and any attempt at substantially reducing cost introduces some errors; (b) The NTCIR-7 IR4QA and the TREC 2004 robust track test collections all yield a comparable and considerable number of errors in response to cost reduction, and this is true despite the fact that the TREC relevance assessments relied on more than twice as many runs as the NTCIR ones; (c) Using 100 topics with depth-30 pools generally yields fewer errors than using 30 topics with depth-100 pools; and (d) Even with depth-100 pools, using fewer than 100 topics results in false alarms, i.e. two systems are declared significantly different even though the full topic set would declare otherwise.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

RIAO Conference

自引率

0.00%

发文量