一个组件级比较IR评价工具

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval Pub Date : 2011-07-24 DOI:10.1145/2009916.2010165

Thomas Wilhelm, Jens Kürsten, Maximilian Eibl

{"title":"一个组件级比较IR评价工具","authors":"Thomas Wilhelm, Jens Kürsten, Maximilian Eibl","doi":"10.1145/2009916.2010165","DOIUrl":null,"url":null,"abstract":"1. MOTIVATION Experimental information retrieval (IR) evaluation is an important instrument to measure the effectiveness of novel methods. Although IR system complexity has grown over years, the general framework for evaluation remained unchanged since its first implementation in the 1960s. Test collections were growing from thousands to millions of documents. Regular reuse resulted in larger topic sets for evaluation. New business models for information access required novel interpretations of effectiveness measures. Nevertheless, most experimental evaluations still rely on an over 50 year old paradigm. Participants of a SIGIR workshop in 2009 [1] discussed the implementation of new methodological standards for evaluation. But at the same time they worried about practicable ways to implement them. A review about recent publications containing experimental evaluations supports this concern [2]. The study also presented a web-based platform for longitudinal evaluation. In a similar way, data from the past decade of CLEF evaluations have been released through the DIRECT system. While the operators of the latter system reported about 50 new users since the release of the data [3], no further contributions were recorded on the web-platform introduced in [2]. In our point of view archiving evaluation data for longitudinal analysis is a first important step. A next step is to develop a methodology that supports researchers in choosing appropriate baselines for comparison. This can be achieved by reporting evaluation results on component level [4] rather than on system level. An exemplary study was presented in [2], where the Indri system was tested with several components switched on or off. Following this idea, an approach to assess novel methods could be to compare to related components only. This would require the community to formally record details of system configurations in connection with experimental results. We suppose that transparent descriptions of system components used in experiments could help researchers in choosing appropriate baselines.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"131 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A tool for comparative IR evaluation on component level\",\"authors\":\"Thomas Wilhelm, Jens Kürsten, Maximilian Eibl\",\"doi\":\"10.1145/2009916.2010165\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"1. MOTIVATION Experimental information retrieval (IR) evaluation is an important instrument to measure the effectiveness of novel methods. Although IR system complexity has grown over years, the general framework for evaluation remained unchanged since its first implementation in the 1960s. Test collections were growing from thousands to millions of documents. Regular reuse resulted in larger topic sets for evaluation. New business models for information access required novel interpretations of effectiveness measures. Nevertheless, most experimental evaluations still rely on an over 50 year old paradigm. Participants of a SIGIR workshop in 2009 [1] discussed the implementation of new methodological standards for evaluation. But at the same time they worried about practicable ways to implement them. A review about recent publications containing experimental evaluations supports this concern [2]. The study also presented a web-based platform for longitudinal evaluation. In a similar way, data from the past decade of CLEF evaluations have been released through the DIRECT system. While the operators of the latter system reported about 50 new users since the release of the data [3], no further contributions were recorded on the web-platform introduced in [2]. In our point of view archiving evaluation data for longitudinal analysis is a first important step. A next step is to develop a methodology that supports researchers in choosing appropriate baselines for comparison. This can be achieved by reporting evaluation results on component level [4] rather than on system level. An exemplary study was presented in [2], where the Indri system was tested with several components switched on or off. Following this idea, an approach to assess novel methods could be to compare to related components only. This would require the community to formally record details of system configurations in connection with experimental results. We suppose that transparent descriptions of system components used in experiments could help researchers in choosing appropriate baselines.\",\"PeriodicalId\":356580,\"journal\":{\"name\":\"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval\",\"volume\":\"131 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2009916.2010165\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2009916.2010165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

1. 动机实验信息检索评价是衡量新方法有效性的重要手段。尽管红外系统的复杂性多年来一直在增长，但总体评估框架自20世纪60年代首次实施以来一直保持不变。测试集合从数千个文档增长到数百万个文档。定期重用会产生更大的主题集用于评估。信息访问的新业务模式需要对有效性度量进行新的解释。然而，大多数实验评估仍然依赖于50多年前的范式。2009年SIGIR研讨会的参与者[1]讨论了新的评估方法标准的实施。但与此同时，他们也担心如何切实可行地实施这些措施。一篇关于包含实验评估的近期出版物的综述支持了这一担忧[2]。该研究还提出了一个基于网络的纵向评估平台。同样，过去十年的CLEF评估数据也通过DIRECT系统发布。自数据发布以来，后一系统的运营商报告了大约50名新用户[3]，但在[2]中引入的web平台上没有记录更多的贡献。在我们看来，将评估数据存档以进行纵向分析是重要的第一步。下一步是开发一种方法，支持研究人员选择适当的比较基线。这可以通过在组件级别[4]而不是在系统级别报告评估结果来实现。在[2]中提出了一个示范性研究，其中对Indri系统进行了几个组件开关测试。按照这个思路，评估新方法的一种方法可能是只与相关组件进行比较。这将要求社区正式记录与实验结果相关的系统配置细节。我们认为，实验中使用的系统组件的透明描述可以帮助研究人员选择适当的基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A tool for comparative IR evaluation on component level

1. MOTIVATION Experimental information retrieval (IR) evaluation is an important instrument to measure the effectiveness of novel methods. Although IR system complexity has grown over years, the general framework for evaluation remained unchanged since its first implementation in the 1960s. Test collections were growing from thousands to millions of documents. Regular reuse resulted in larger topic sets for evaluation. New business models for information access required novel interpretations of effectiveness measures. Nevertheless, most experimental evaluations still rely on an over 50 year old paradigm. Participants of a SIGIR workshop in 2009 [1] discussed the implementation of new methodological standards for evaluation. But at the same time they worried about practicable ways to implement them. A review about recent publications containing experimental evaluations supports this concern [2]. The study also presented a web-based platform for longitudinal evaluation. In a similar way, data from the past decade of CLEF evaluations have been released through the DIRECT system. While the operators of the latter system reported about 50 new users since the release of the data [3], no further contributions were recorded on the web-platform introduced in [2]. In our point of view archiving evaluation data for longitudinal analysis is a first important step. A next step is to develop a methodology that supports researchers in choosing appropriate baselines for comparison. This can be achieved by reporting evaluation results on component level [4] rather than on system level. An exemplary study was presented in [2], where the Indri system was tested with several components switched on or off. Following this idea, an approach to assess novel methods could be to compare to related components only. This would require the community to formally record details of system configurations in connection with experimental results. We suppose that transparent descriptions of system components used in experiments could help researchers in choosing appropriate baselines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

自引率

0.00%

发文量