{"title":"一个组件级比较IR评价工具","authors":"Thomas Wilhelm, Jens Kürsten, Maximilian Eibl","doi":"10.1145/2009916.2010165","DOIUrl":null,"url":null,"abstract":"1. MOTIVATION Experimental information retrieval (IR) evaluation is an important instrument to measure the effectiveness of novel methods. Although IR system complexity has grown over years, the general framework for evaluation remained unchanged since its first implementation in the 1960s. Test collections were growing from thousands to millions of documents. Regular reuse resulted in larger topic sets for evaluation. New business models for information access required novel interpretations of effectiveness measures. Nevertheless, most experimental evaluations still rely on an over 50 year old paradigm. Participants of a SIGIR workshop in 2009 [1] discussed the implementation of new methodological standards for evaluation. But at the same time they worried about practicable ways to implement them. A review about recent publications containing experimental evaluations supports this concern [2]. The study also presented a web-based platform for longitudinal evaluation. In a similar way, data from the past decade of CLEF evaluations have been released through the DIRECT system. While the operators of the latter system reported about 50 new users since the release of the data [3], no further contributions were recorded on the web-platform introduced in [2]. In our point of view archiving evaluation data for longitudinal analysis is a first important step. A next step is to develop a methodology that supports researchers in choosing appropriate baselines for comparison. This can be achieved by reporting evaluation results on component level [4] rather than on system level. An exemplary study was presented in [2], where the Indri system was tested with several components switched on or off. Following this idea, an approach to assess novel methods could be to compare to related components only. This would require the community to formally record details of system configurations in connection with experimental results. We suppose that transparent descriptions of system components used in experiments could help researchers in choosing appropriate baselines.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"131 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A tool for comparative IR evaluation on component level\",\"authors\":\"Thomas Wilhelm, Jens Kürsten, Maximilian Eibl\",\"doi\":\"10.1145/2009916.2010165\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"1. MOTIVATION Experimental information retrieval (IR) evaluation is an important instrument to measure the effectiveness of novel methods. Although IR system complexity has grown over years, the general framework for evaluation remained unchanged since its first implementation in the 1960s. Test collections were growing from thousands to millions of documents. Regular reuse resulted in larger topic sets for evaluation. New business models for information access required novel interpretations of effectiveness measures. Nevertheless, most experimental evaluations still rely on an over 50 year old paradigm. Participants of a SIGIR workshop in 2009 [1] discussed the implementation of new methodological standards for evaluation. But at the same time they worried about practicable ways to implement them. A review about recent publications containing experimental evaluations supports this concern [2]. The study also presented a web-based platform for longitudinal evaluation. In a similar way, data from the past decade of CLEF evaluations have been released through the DIRECT system. While the operators of the latter system reported about 50 new users since the release of the data [3], no further contributions were recorded on the web-platform introduced in [2]. In our point of view archiving evaluation data for longitudinal analysis is a first important step. A next step is to develop a methodology that supports researchers in choosing appropriate baselines for comparison. This can be achieved by reporting evaluation results on component level [4] rather than on system level. An exemplary study was presented in [2], where the Indri system was tested with several components switched on or off. Following this idea, an approach to assess novel methods could be to compare to related components only. This would require the community to formally record details of system configurations in connection with experimental results. We suppose that transparent descriptions of system components used in experiments could help researchers in choosing appropriate baselines.\",\"PeriodicalId\":356580,\"journal\":{\"name\":\"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval\",\"volume\":\"131 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2009916.2010165\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2009916.2010165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A tool for comparative IR evaluation on component level
1. MOTIVATION Experimental information retrieval (IR) evaluation is an important instrument to measure the effectiveness of novel methods. Although IR system complexity has grown over years, the general framework for evaluation remained unchanged since its first implementation in the 1960s. Test collections were growing from thousands to millions of documents. Regular reuse resulted in larger topic sets for evaluation. New business models for information access required novel interpretations of effectiveness measures. Nevertheless, most experimental evaluations still rely on an over 50 year old paradigm. Participants of a SIGIR workshop in 2009 [1] discussed the implementation of new methodological standards for evaluation. But at the same time they worried about practicable ways to implement them. A review about recent publications containing experimental evaluations supports this concern [2]. The study also presented a web-based platform for longitudinal evaluation. In a similar way, data from the past decade of CLEF evaluations have been released through the DIRECT system. While the operators of the latter system reported about 50 new users since the release of the data [3], no further contributions were recorded on the web-platform introduced in [2]. In our point of view archiving evaluation data for longitudinal analysis is a first important step. A next step is to develop a methodology that supports researchers in choosing appropriate baselines for comparison. This can be achieved by reporting evaluation results on component level [4] rather than on system level. An exemplary study was presented in [2], where the Indri system was tested with several components switched on or off. Following this idea, an approach to assess novel methods could be to compare to related components only. This would require the community to formally record details of system configurations in connection with experimental results. We suppose that transparent descriptions of system components used in experiments could help researchers in choosing appropriate baselines.