基于Transformer深度模型的文本反馈图像检索

2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Pub Date : 2021-12-21 DOI:10.1109/NICS54270.2021.9701539

Truc Luong-Phuong Huynh, N. Ly

{"title":"基于Transformer深度模型的文本反馈图像检索","authors":"Truc Luong-Phuong Huynh, N. Ly","doi":"10.1109/NICS54270.2021.9701539","DOIUrl":null,"url":null,"abstract":"Image retrieval with text feedback has many potentials when applied in product retrieval for e-commerce platforms. Given an input image and text feedback, the system needs to retrieve images that not only look visually similar to the input image but also have some modified details mentioned in the text feedback. This is a tricky task as it requires a good understanding of image, text, and also their combination. In this paper, we propose a novel framework called Image-Text Modify Attention (ITMA) and a Transformer-based combining function that performs preservation and transformation features of the input image based on the text feedback and captures important features of database images. By using multiple image features at different Convolution Neural Network (CNN) depths, the combining function can have multi-level visual information to achieve an impressive representation that satisfies for effective image retrieval. We conduct quantitative and qualitative experiments on two datasets: CSS and FashionIQ. ITMA outperforms existing approaches on these datasets and can deal with many types of text feedback such as object attributes and natural language. We are also the first ones to discover the exceptional behavior of the attention mechanism in this task which ignores input image regions where text feedback wants to remove or change.","PeriodicalId":296963,"journal":{"name":"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Image Retrieval with Text Feedback based on Transformer Deep Model\",\"authors\":\"Truc Luong-Phuong Huynh, N. Ly\",\"doi\":\"10.1109/NICS54270.2021.9701539\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image retrieval with text feedback has many potentials when applied in product retrieval for e-commerce platforms. Given an input image and text feedback, the system needs to retrieve images that not only look visually similar to the input image but also have some modified details mentioned in the text feedback. This is a tricky task as it requires a good understanding of image, text, and also their combination. In this paper, we propose a novel framework called Image-Text Modify Attention (ITMA) and a Transformer-based combining function that performs preservation and transformation features of the input image based on the text feedback and captures important features of database images. By using multiple image features at different Convolution Neural Network (CNN) depths, the combining function can have multi-level visual information to achieve an impressive representation that satisfies for effective image retrieval. We conduct quantitative and qualitative experiments on two datasets: CSS and FashionIQ. ITMA outperforms existing approaches on these datasets and can deal with many types of text feedback such as object attributes and natural language. We are also the first ones to discover the exceptional behavior of the attention mechanism in this task which ignores input image regions where text feedback wants to remove or change.\",\"PeriodicalId\":296963,\"journal\":{\"name\":\"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)\",\"volume\":\"63 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NICS54270.2021.9701539\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS54270.2021.9701539","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基于文本反馈的图像检索在电子商务平台的产品检索中具有很大的应用潜力。给定输入图像和文本反馈，系统需要检索图像，这些图像不仅在视觉上看起来与输入图像相似，而且还具有文本反馈中提到的一些修改细节。这是一项棘手的任务，因为它需要对图像、文本及其组合有很好的理解。在本文中，我们提出了一个新的框架，称为图像-文本修改注意(ITMA)和一个基于transformer的组合函数，该函数基于文本反馈对输入图像执行保存和转换特征，并捕获数据库图像的重要特征。通过在不同卷积神经网络(CNN)深度下使用多个图像特征，组合函数可以具有多层次的视觉信息，从而获得令人印象深刻的表示，满足有效的图像检索。我们在CSS和FashionIQ两个数据集上进行了定量和定性实验。ITMA在这些数据集上优于现有的方法，并且可以处理许多类型的文本反馈，如对象属性和自然语言。我们也是第一个发现注意机制在这个任务中的特殊行为的人，它忽略了文本反馈想要删除或改变的输入图像区域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Image Retrieval with Text Feedback based on Transformer Deep Model

Image retrieval with text feedback has many potentials when applied in product retrieval for e-commerce platforms. Given an input image and text feedback, the system needs to retrieve images that not only look visually similar to the input image but also have some modified details mentioned in the text feedback. This is a tricky task as it requires a good understanding of image, text, and also their combination. In this paper, we propose a novel framework called Image-Text Modify Attention (ITMA) and a Transformer-based combining function that performs preservation and transformation features of the input image based on the text feedback and captures important features of database images. By using multiple image features at different Convolution Neural Network (CNN) depths, the combining function can have multi-level visual information to achieve an impressive representation that satisfies for effective image retrieval. We conduct quantitative and qualitative experiments on two datasets: CSS and FashionIQ. ITMA outperforms existing approaches on these datasets and can deal with many types of text feedback such as object attributes and natural language. We are also the first ones to discover the exceptional behavior of the attention mechanism in this task which ignores input image regions where text feedback wants to remove or change.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 8th NAFOSTED Conference on Information and Computer Science (NICS)

自引率

0.00%

发文量