FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback

Sonam Goenka, Zhao-Heng Zheng, Ayush Jaiswal, Rakesh Chada, Yuehua Wu, Varsha Hedau, P. Natarajan
{"title":"FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback","authors":"Sonam Goenka, Zhao-Heng Zheng, Ayush Jaiswal, Rakesh Chada, Yuehua Wu, Varsha Hedau, P. Natarajan","doi":"10.1109/CVPR52688.2022.01371","DOIUrl":null,"url":null,"abstract":"Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image retrieval, and combines visual information from multiple levels of context to effectively capture fashion-related information. While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionIQ dataset, which contains complex natural language feedback.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52688.2022.01371","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 40

Abstract

Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image retrieval, and combines visual information from multiple levels of context to effectively capture fashion-related information. While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionIQ dataset, which contains complex natural language feedback.
时尚vlp:基于反馈的时尚检索视觉语言转换器
基于参考图像查询对和自然语言反馈的时尚图像检索是一项具有挑战性的任务,它要求模特同时从视觉和文本模式中评估时尚相关信息。本文提出了一种新的基于视觉语言转换器的FashionVLP模型,该模型将大型图像文本语料库中包含的先验知识引入到时尚图像检索领域,并结合来自多个上下文层次的视觉信息,有效地捕获时尚相关信息。虽然查询是通过转换层编码的,但我们的非对称设计采用了一种新颖的基于注意力的方法来融合目标图像特征,而不涉及过程中的文本或转换层。广泛的结果表明,FashionVLP在基准数据集上达到了最先进的性能,在包含复杂自然语言反馈的具有挑战性的FashionIQ数据集上相对提高了23%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信