{"title":"Reproducibility Challenges in Information Retrieval Evaluation","authors":"N. Ferro","doi":"10.1145/3020206","DOIUrl":null,"url":null,"abstract":"Information Retrieval (IR) is concerned with ranking information resources with respect to user information needs, delivering a wide range of key applications for industry and society, such as Web search engines [Croft et al. 2009], intellectual property, and patent search [Lupu and Hanbury 2013], and many others. The performance of IR systems is determined not only by their efficiency but also and most importantly by their effectiveness, that is, their ability to retrieve and better rank relevant information resources while at the same time suppressing the retrieval of not relevant ones. Due to the many sources of uncertainty, as for example vague user information needs, unstructured information sources, or subjective notion of relevance, experimental evaluation is the only mean to assess the performances of IR systems from the effectiveness point of view. Experimental evaluation relies on the Cranfield paradigm, which makes use of experimental collections, consisting of documents, sampled from a real domain of interest; topics, representing real user information needs in that domain; and relevance judgements, determining which documents are relevant to which topics [Harman 2011]. To share the effort and optimize the use of resources, experimental evaluation is usually carried out in publicly open and large-scale evaluation campaigns at the international level, like the Text REtrieval Conference (TREC)1 in the United States [Harman and Voorhees 2005], the Conference and Labs of the Evaluation Forum (CLEF)2 in Europe [Ferro 2014], the NII Testbeds and Community for Information access Research (NTCIR)3 in Japan and Asia, and the Forum for Information Retrieval Evaluation (FIRE)4 in India. These initiatives produce, every year, huge amounts of scientific data","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"52 1","pages":"1 - 4"},"PeriodicalIF":0.0000,"publicationDate":"2017-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3020206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 49
Abstract
Information Retrieval (IR) is concerned with ranking information resources with respect to user information needs, delivering a wide range of key applications for industry and society, such as Web search engines [Croft et al. 2009], intellectual property, and patent search [Lupu and Hanbury 2013], and many others. The performance of IR systems is determined not only by their efficiency but also and most importantly by their effectiveness, that is, their ability to retrieve and better rank relevant information resources while at the same time suppressing the retrieval of not relevant ones. Due to the many sources of uncertainty, as for example vague user information needs, unstructured information sources, or subjective notion of relevance, experimental evaluation is the only mean to assess the performances of IR systems from the effectiveness point of view. Experimental evaluation relies on the Cranfield paradigm, which makes use of experimental collections, consisting of documents, sampled from a real domain of interest; topics, representing real user information needs in that domain; and relevance judgements, determining which documents are relevant to which topics [Harman 2011]. To share the effort and optimize the use of resources, experimental evaluation is usually carried out in publicly open and large-scale evaluation campaigns at the international level, like the Text REtrieval Conference (TREC)1 in the United States [Harman and Voorhees 2005], the Conference and Labs of the Evaluation Forum (CLEF)2 in Europe [Ferro 2014], the NII Testbeds and Community for Information access Research (NTCIR)3 in Japan and Asia, and the Forum for Information Retrieval Evaluation (FIRE)4 in India. These initiatives produce, every year, huge amounts of scientific data