哪些多样性评估措施是“好的”?

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2019-07-18 DOI:10.1145/3331184.3331215

T. Sakai, Zhaohao Zeng

{"title":"哪些多样性评估措施是“好的”?","authors":"T. Sakai, Zhaohao Zeng","doi":"10.1145/3331184.3331215","DOIUrl":null,"url":null,"abstract":"This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question \"Which SERP is more relevant?'' and the other based on a diversity question \"Which SERP is likely to satisfy a higher number of users?'' To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the ♯-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best ♯-measures when compared to the gold diversity preferences.","PeriodicalId":20700,"journal":{"name":"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"62 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":"{\"title\":\"Which Diversity Evaluation Measures Are \\\"Good\\\"?\",\"authors\":\"T. Sakai, Zhaohao Zeng\",\"doi\":\"10.1145/3331184.3331215\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question \\\"Which SERP is more relevant?'' and the other based on a diversity question \\\"Which SERP is likely to satisfy a higher number of users?'' To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the ♯-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best ♯-measures when compared to the gold diversity preferences.\",\"PeriodicalId\":20700,\"journal\":{\"name\":\"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"volume\":\"62 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"36\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3331184.3331215\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3331184.3331215","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

摘要

本研究评估了30个IR评估措施或其实例，其中9个用于特殊IR, 21个用于多样化IR，主要是从他们对一个SERP(搜索引擎结果页面)的偏好是否与另一个用户的偏好相一致的角度出发。黄金偏好由15名评估人员构建，他们独立检查了1,127对SERP并进行了偏好评估。获得了两组偏好评估:一组基于相关性问题“哪个SERP更相关?”，另一个则是基于一个多样性问题:“哪个SERP可能满足更多的用户?”“据我们所知，我们的研究是第一个以这种方式收集多样性偏好评估并成功评估多样性措施的研究。我们的主要结果是:(a)流行的临时IR指标，如nDCG，实际上与黄金相关性偏好相当一致;并且(b)虽然# -措施与黄金多样性偏好很好地一致，但意图意识措施表现相对较差。此外，作为我们对现有评价措施分析的副产品，我们定义了新的特别措施，称为iRBU(故意秩偏效用)和EBR(预期混合比率);我们证明，与黄金相关偏好相比，iRBU实例的表现与nDCG一样好。另一方面，与黄金多样性偏好相比，最初的RBU，最近提出的多样性指标，表现不如最佳# -指标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Which Diversity Evaluation Measures Are "Good"?

This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question "Which SERP is more relevant?'' and the other based on a diversity question "Which SERP is likely to satisfy a higher number of users?'' To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the ♯-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best ♯-measures when compared to the gold diversity preferences.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量