基于视觉语言注意学习的文本反馈图像搜索

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2020-06-01 DOI:10.1109/CVPR42600.2020.00307

Yanbei Chen, S. Gong, Loris Bazzani

{"title":"基于视觉语言注意学习的文本反馈图像搜索","authors":"Yanbei Chen, S. Gong, Loris Bazzani","doi":"10.1109/CVPR42600.2020.00307","DOIUrl":null,"url":null,"abstract":"Image search with text feedback has promising impacts in various real-world applications, such as e-commerce and internet search. Given a reference image and text feedback from user, the goal is to retrieve images that not only resemble the input image, but also change certain aspects in accordance with the given text. This is a challenging task as it requires the synergistic understanding of both image and text. In this work, we tackle this task by a novel Visiolinguistic Attention Learning (VAL) framework. Specifically, we propose a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics. By inserting multiple composite transformers at varying depths, VAL is incentive to encapsulate the multi-granular visiolinguistic information, thus yielding an expressive representation for effective image search. We conduct comprehensive evaluation on three datasets: Fashion200k, Shoes and FashionIQ. Extensive experiments show our model exceeds existing approaches on all datasets, demonstrating consistent superiority in coping with various text feedbacks, including attribute-like and natural language descriptions.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"42 1","pages":"2998-3008"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"109","resultStr":"{\"title\":\"Image Search With Text Feedback by Visiolinguistic Attention Learning\",\"authors\":\"Yanbei Chen, S. Gong, Loris Bazzani\",\"doi\":\"10.1109/CVPR42600.2020.00307\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image search with text feedback has promising impacts in various real-world applications, such as e-commerce and internet search. Given a reference image and text feedback from user, the goal is to retrieve images that not only resemble the input image, but also change certain aspects in accordance with the given text. This is a challenging task as it requires the synergistic understanding of both image and text. In this work, we tackle this task by a novel Visiolinguistic Attention Learning (VAL) framework. Specifically, we propose a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics. By inserting multiple composite transformers at varying depths, VAL is incentive to encapsulate the multi-granular visiolinguistic information, thus yielding an expressive representation for effective image search. We conduct comprehensive evaluation on three datasets: Fashion200k, Shoes and FashionIQ. Extensive experiments show our model exceeds existing approaches on all datasets, demonstrating consistent superiority in coping with various text feedbacks, including attribute-like and natural language descriptions.\",\"PeriodicalId\":6715,\"journal\":{\"name\":\"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"volume\":\"42 1\",\"pages\":\"2998-3008\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"109\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPR42600.2020.00307\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR42600.2020.00307","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 109

摘要

具有文本反馈的图像搜索在电子商务和互联网搜索等各种现实应用中具有广阔的应用前景。给定用户的参考图像和文本反馈，目标是检索图像不仅与输入图像相似，而且根据给定文本改变某些方面。这是一项具有挑战性的任务，因为它需要对图像和文本的协同理解。在这项工作中，我们通过一个新的视觉语言注意学习(VAL)框架来解决这个问题。具体来说，我们提出了一种可以无缝插入CNN的复合转换器，以选择性地保留和转换语言语义条件下的视觉特征。通过在不同深度插入多个复合变形器，VAL被激励封装多颗粒视觉语言信息，从而为有效的图像搜索提供富有表现力的表示。我们对三个数据集:Fashion200k、Shoes和FashionIQ进行了综合评估。大量的实验表明，我们的模型在所有数据集上都超过了现有的方法，在处理各种文本反馈(包括类属性和自然语言描述)方面表现出一致的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Image Search With Text Feedback by Visiolinguistic Attention Learning

Image search with text feedback has promising impacts in various real-world applications, such as e-commerce and internet search. Given a reference image and text feedback from user, the goal is to retrieve images that not only resemble the input image, but also change certain aspects in accordance with the given text. This is a challenging task as it requires the synergistic understanding of both image and text. In this work, we tackle this task by a novel Visiolinguistic Attention Learning (VAL) framework. Specifically, we propose a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics. By inserting multiple composite transformers at varying depths, VAL is incentive to encapsulate the multi-granular visiolinguistic information, thus yielding an expressive representation for effective image search. We conduct comprehensive evaluation on three datasets: Fashion200k, Shoes and FashionIQ. Extensive experiments show our model exceeds existing approaches on all datasets, demonstrating consistent superiority in coping with various text feedbacks, including attribute-like and natural language descriptions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量