Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2017-07-21 DOI:10.1109/CVPR.2017.560

Albert Gordo, Diane Larlus

{"title":"Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval","authors":"Albert Gordo, Diane Larlus","doi":"10.1109/CVPR.2017.560","DOIUrl":null,"url":null,"abstract":"Querying with an example image is a simple and intuitive interface to retrieve information from a visual database. Most of the research in image retrieval has focused on the task of instance-level image retrieval, where the goal is to retrieve images that contain the same object instance as the query image. In this work we move beyond instance-level retrieval and consider the task of semantic image retrieval in complex scenes, where the goal is to retrieve images that share the same semantics as the query image. We show that, despite its subjective nature, the task of semantically ranking visual scenes is consistently implemented across a pool of human annotators. We also show that a similarity based on human-annotated region-level captions is highly correlated with the human ranking and constitutes a good computable surrogate. Following this observation, we learn a visual embedding of the images where the similarity in the visual space is correlated with their semantic similarity surrogate. We further extend our model to learn a joint embedding of visual and textual cues that allows one to query the database using a text modifier in addition to the query image, adapting the results to the modifier. Finally, our model can ground the ranking decisions by showing regions that contributed the most to the similarity between pairs of images, providing a visual explanation of the similarity.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"20 1","pages":"5272-5281"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"81","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2017.560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 81

Abstract

Querying with an example image is a simple and intuitive interface to retrieve information from a visual database. Most of the research in image retrieval has focused on the task of instance-level image retrieval, where the goal is to retrieve images that contain the same object instance as the query image. In this work we move beyond instance-level retrieval and consider the task of semantic image retrieval in complex scenes, where the goal is to retrieve images that share the same semantics as the query image. We show that, despite its subjective nature, the task of semantically ranking visual scenes is consistently implemented across a pool of human annotators. We also show that a similarity based on human-annotated region-level captions is highly correlated with the human ranking and constitutes a good computable surrogate. Following this observation, we learn a visual embedding of the images where the similarity in the visual space is correlated with their semantic similarity surrogate. We further extend our model to learn a joint embedding of visual and textual cues that allows one to query the database using a text modifier in addition to the query image, adapting the results to the modifier. Finally, our model can ground the ranking decisions by showing regions that contributed the most to the similarity between pairs of images, providing a visual explanation of the similarity.

查看原文本刊更多论文

超越实例级图像检索:利用标题学习语义检索的全局视觉表示

使用示例图像查询是一个简单而直观的界面，用于从可视化数据库检索信息。大多数图像检索的研究都集中在实例级图像检索任务上，其目标是检索包含与查询图像相同的对象实例的图像。在这项工作中，我们超越了实例级检索，并考虑了复杂场景中的语义图像检索任务，其目标是检索与查询图像具有相同语义的图像。我们表明，尽管具有主观性，但对视觉场景进行语义排序的任务是在人类注释器池中一致实现的。我们还表明，基于人类注释的区域级标题的相似性与人类排名高度相关，并且构成了一个很好的可计算代理。在此观察之后，我们学习图像的视觉嵌入，其中视觉空间中的相似性与它们的语义相似性代理相关。我们进一步扩展了我们的模型，以学习视觉和文本线索的联合嵌入，这使得除了查询图像之外，还可以使用文本修饰符来查询数据库，使结果适应修饰符。最后，我们的模型可以通过显示对图像对相似性贡献最大的区域，提供相似性的视觉解释，从而为排名决策奠定基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量