基于偏好的网络图像搜索评价指标

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2020-07-25 DOI:10.1145/3397271.3401146

Xiaohui Xie, Jiaxin Mao, Y. Liu, M. de Rijke, Haitian Chen, Min Zhang, Shaoping Ma

{"title":"基于偏好的网络图像搜索评价指标","authors":"Xiaohui Xie, Jiaxin Mao, Y. Liu, M. de Rijke, Haitian Chen, Min Zhang, Shaoping Ma","doi":"10.1145/3397271.3401146","DOIUrl":null,"url":null,"abstract":"Following the success of Cranfield-like evaluation approaches to evaluation in web search, web image search has also been evaluated with absolute judgments of (graded) relevance. However, recent research has found that collecting absolute relevance judgments may be difficult in image search scenarios due to the multi-dimensional nature of relevance for image results. Moreover, existing evaluation metrics based on absolute relevance judgments do not correlate well with search users' satisfaction perceptions in web image search. Unlike absolute relevance judgments, preference judgments do not require that relevance grades be pre-defined, i.e., how many levels to use and what those levels mean. Instead of considering each document in isolation, preference judgments consider a pair of documents and require judges to state their relative preference. Such preference judgments are usually more reliable than absolute judgments since the presence of (at least) two items establishes a certain context. While preference judgments have been studied extensively for general web search, there exists no thorough investigation on how preference judgments and preference-based evaluation metrics can be used to evaluate web image search systems. Compared to general web search, web image search may be an even better fit for preference-based evaluation because of its grid-based presentation style. The limited need for fresh results in web image search also makes preference judgments more reusable than for general web search. In this paper, we provide a thorough comparison of variants of preference judgments for web image search. We find that compared to strict preference judgments, weak preference judgments require less time and have better inter-assessor agreement. We also study how absolute relevance levels of two given images affect preference judgments between them. Furthermore, we propose a preference-based evaluation metric named Preference-Winning-Penalty (PWP) to evaluate and compare between two different image search systems. The proposed PWP metric outperforms existing evaluation metrics based on absolute relevance judgments in terms of agreement to system-level preferences of actual users.","PeriodicalId":252050,"journal":{"name":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Preference-based Evaluation Metrics for Web Image Search\",\"authors\":\"Xiaohui Xie, Jiaxin Mao, Y. Liu, M. de Rijke, Haitian Chen, Min Zhang, Shaoping Ma\",\"doi\":\"10.1145/3397271.3401146\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Following the success of Cranfield-like evaluation approaches to evaluation in web search, web image search has also been evaluated with absolute judgments of (graded) relevance. However, recent research has found that collecting absolute relevance judgments may be difficult in image search scenarios due to the multi-dimensional nature of relevance for image results. Moreover, existing evaluation metrics based on absolute relevance judgments do not correlate well with search users' satisfaction perceptions in web image search. Unlike absolute relevance judgments, preference judgments do not require that relevance grades be pre-defined, i.e., how many levels to use and what those levels mean. Instead of considering each document in isolation, preference judgments consider a pair of documents and require judges to state their relative preference. Such preference judgments are usually more reliable than absolute judgments since the presence of (at least) two items establishes a certain context. While preference judgments have been studied extensively for general web search, there exists no thorough investigation on how preference judgments and preference-based evaluation metrics can be used to evaluate web image search systems. Compared to general web search, web image search may be an even better fit for preference-based evaluation because of its grid-based presentation style. The limited need for fresh results in web image search also makes preference judgments more reusable than for general web search. In this paper, we provide a thorough comparison of variants of preference judgments for web image search. We find that compared to strict preference judgments, weak preference judgments require less time and have better inter-assessor agreement. We also study how absolute relevance levels of two given images affect preference judgments between them. Furthermore, we propose a preference-based evaluation metric named Preference-Winning-Penalty (PWP) to evaluate and compare between two different image search systems. The proposed PWP metric outperforms existing evaluation metrics based on absolute relevance judgments in terms of agreement to system-level preferences of actual users.\",\"PeriodicalId\":252050,\"journal\":{\"name\":\"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3397271.3401146\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3397271.3401146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

继类似克兰菲尔德的评价方法在网络搜索中的成功之后，网络图像搜索也被评价为绝对(分级)相关性判断。然而，最近的研究发现，由于图像结果的相关性具有多维性，因此在图像搜索场景中收集绝对相关性判断可能很困难。此外，现有的基于绝对相关性判断的评价指标与搜索用户在网络图像搜索中的满意度感知相关性不强。与绝对相关性判断不同，偏好判断不需要预先定义相关性等级，即使用多少级别以及这些级别意味着什么。偏好判断不是孤立地考虑每个文件，而是考虑一对文件，并要求法官陈述他们的相对偏好。这种偏好判断通常比绝对判断更可靠，因为(至少)两个项目的存在建立了特定的上下文。虽然偏好判断已经被广泛地用于一般的网络搜索，但对于如何使用偏好判断和基于偏好的评估指标来评估网络图像搜索系统，还没有深入的研究。与一般的网络搜索相比，网络图像搜索可能更适合基于偏好的评估，因为它基于网格的呈现风格。网络图像搜索对新结果的需求有限，这也使得偏好判断比一般的网络搜索更具可重用性。在本文中，我们提供了一个全面的比较的变体偏好判断的网络图像搜索。我们发现，与严格偏好判断相比，弱偏好判断所需的时间更少，并且评估者之间的一致性更好。我们还研究了两个给定图像的绝对相关水平如何影响它们之间的偏好判断。此外，我们提出了一个基于偏好的评价指标，称为偏好-获胜-惩罚(PWP)，以评估和比较两种不同的图像搜索系统。建议的PWP度量优于现有的基于对实际用户的系统级偏好的绝对相关性判断的评估度量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Preference-based Evaluation Metrics for Web Image Search

Following the success of Cranfield-like evaluation approaches to evaluation in web search, web image search has also been evaluated with absolute judgments of (graded) relevance. However, recent research has found that collecting absolute relevance judgments may be difficult in image search scenarios due to the multi-dimensional nature of relevance for image results. Moreover, existing evaluation metrics based on absolute relevance judgments do not correlate well with search users' satisfaction perceptions in web image search. Unlike absolute relevance judgments, preference judgments do not require that relevance grades be pre-defined, i.e., how many levels to use and what those levels mean. Instead of considering each document in isolation, preference judgments consider a pair of documents and require judges to state their relative preference. Such preference judgments are usually more reliable than absolute judgments since the presence of (at least) two items establishes a certain context. While preference judgments have been studied extensively for general web search, there exists no thorough investigation on how preference judgments and preference-based evaluation metrics can be used to evaluate web image search systems. Compared to general web search, web image search may be an even better fit for preference-based evaluation because of its grid-based presentation style. The limited need for fresh results in web image search also makes preference judgments more reusable than for general web search. In this paper, we provide a thorough comparison of variants of preference judgments for web image search. We find that compared to strict preference judgments, weak preference judgments require less time and have better inter-assessor agreement. We also study how absolute relevance levels of two given images affect preference judgments between them. Furthermore, we propose a preference-based evaluation metric named Preference-Winning-Penalty (PWP) to evaluate and compare between two different image search systems. The proposed PWP metric outperforms existing evaluation metrics based on absolute relevance judgments in terms of agreement to system-level preferences of actual users.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量