Evaluating diversified search results using per-intent graded relevance

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval Pub Date : 2011-07-24 DOI:10.1145/2009916.2010055

T. Sakai, Ruihua Song

{"title":"Evaluating diversified search results using per-intent graded relevance","authors":"T. Sakai, Ruihua Song","doi":"10.1145/2009916.2010055","DOIUrl":null,"url":null,"abstract":"Search queries are often ambiguous and/or underspecified. To accomodate different user needs, search result diversification has received attention in the past few years. Accordingly, several new metrics for evaluating diversification have been proposed, but their properties are little understood. We compare the properties of existing metrics given the premises that (1) queries may have multiple intents; (2) the likelihood of each intent given a query is available; and (3) graded relevance assessments are available for each intent. We compare a wide range of traditional and diversified IR metrics after adding graded relevance assessments to the TREC 2009 Web track diversity task test collection which originally had binary relevance assessments. Our primary criterion is discriminative power, which represents the reliability of a metric in an experiment. Our results show that diversified IR experiments with a given number of topics can be as reliable as traditional IR experiments with the same number of topics, provided that the right metrics are used. Moreover, we compare the intuitiveness of diversified IR metrics by closely examining the actual ranked lists from TREC. We show that a family of metrics called D#-measures have several advantages over other metrics such as α-nDCG and Intent-Aware metrics.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"163","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2009916.2010055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 163

Abstract

Search queries are often ambiguous and/or underspecified. To accomodate different user needs, search result diversification has received attention in the past few years. Accordingly, several new metrics for evaluating diversification have been proposed, but their properties are little understood. We compare the properties of existing metrics given the premises that (1) queries may have multiple intents; (2) the likelihood of each intent given a query is available; and (3) graded relevance assessments are available for each intent. We compare a wide range of traditional and diversified IR metrics after adding graded relevance assessments to the TREC 2009 Web track diversity task test collection which originally had binary relevance assessments. Our primary criterion is discriminative power, which represents the reliability of a metric in an experiment. Our results show that diversified IR experiments with a given number of topics can be as reliable as traditional IR experiments with the same number of topics, provided that the right metrics are used. Moreover, we compare the intuitiveness of diversified IR metrics by closely examining the actual ranked lists from TREC. We show that a family of metrics called D#-measures have several advantages over other metrics such as α-nDCG and Intent-Aware metrics.

查看原文本刊更多论文

使用每个意图分级相关性评估多样化的搜索结果

搜索查询通常是含糊不清和/或指定不足的。为了适应不同的用户需求，搜索结果多样化在过去几年受到了关注。因此，已经提出了几个评估多样化的新指标，但人们对它们的性质知之甚少。我们比较了现有指标的属性，前提是:(1)查询可能有多个意图;(2)给定查询的每个意图的可能性是可用的;(3)对每个意图进行分级相关性评估。在TREC 2009网络轨道多样性任务测试集中加入分级相关性评估后，我们比较了广泛的传统和多样化IR指标，该测试集最初具有二元相关性评估。我们的主要标准是判别能力，它代表了实验中度量的可靠性。我们的研究结果表明，只要使用正确的指标，具有给定主题数量的多样化IR实验可以与具有相同主题数量的传统IR实验一样可靠。此外，我们通过仔细检查TREC的实际排名列表来比较多样化IR指标的直观性。我们表明，与α-nDCG和意图感知指标等其他指标相比，称为d#指标的一系列指标具有若干优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

自引率

0.00%

发文量