Models and metrics: IR evaluation as a user process

Australasian Document Computing Symposium Pub Date : 2012-12-05 DOI:10.1145/2407085.2407092

Alistair Moffat, Falk Scholer, Paul Thomas

{"title":"Models and metrics: IR evaluation as a user process","authors":"Alistair Moffat, Falk Scholer, Paul Thomas","doi":"10.1145/2407085.2407092","DOIUrl":null,"url":null,"abstract":"Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgments. The former has the benefit of directly assessing the actual goal of the system, namely the user's ability to complete a search task; whereas the latter approach has the benefit of being quantitative and repeatable. Each given effectiveness metric is an attempt to bridge the gap between these two evaluation approaches, since the implicit belief supporting the use of any particular metric is that user task performance should be correlated with the numeric score provided by the metric. In this work we explore that linkage, considering a range of effectiveness metrics, and the user search behavior that each of them implies. We then examine more complex user models, as a guide to the development of new effectiveness metrics. We conclude by summarizing an experiment that we believe will help establish the strength of the linkage between models and metrics.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australasian Document Computing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2407085.2407092","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

Abstract

Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgments. The former has the benefit of directly assessing the actual goal of the system, namely the user's ability to complete a search task; whereas the latter approach has the benefit of being quantitative and repeatable. Each given effectiveness metric is an attempt to bridge the gap between these two evaluation approaches, since the implicit belief supporting the use of any particular metric is that user task performance should be correlated with the numeric score provided by the metric. In this work we explore that linkage, considering a range of effectiveness metrics, and the user search behavior that each of them implies. We then examine more complex user models, as a guide to the development of new effectiveness metrics. We conclude by summarizing an experiment that we believe will help establish the strength of the linkage between models and metrics.

查看原文本刊更多论文

模型和度量:作为用户过程的IR评估

检索系统的有效性可以通过两种截然不同的方式来衡量:通过监测用户的行为并收集有关用户完成某些特定信息搜索任务的容易程度和准确性的数据;或者通过使用数字有效性度量来根据一组相关判断对系统运行进行评分。前者的好处是可以直接评估系统的实际目标，即用户完成搜索任务的能力;而后一种方法具有定量和可重复的优点。每个给定的有效性度量都试图弥合这两种评估方法之间的差距，因为支持使用任何特定度量的隐含信念是，用户任务性能应该与度量提供的数字分数相关联。在这项工作中，我们探讨了这种联系，考虑了一系列的有效性指标，以及每个指标所暗示的用户搜索行为。然后，我们将研究更复杂的用户模型，作为开发新的有效性指标的指南。我们通过总结一个实验来结束本文，我们相信这个实验将有助于建立模型和度量之间联系的强度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Australasian Document Computing Symposium

自引率

0.00%

发文量