基于变换编码器深度特征的高效跨模态视觉文本检索

Nicola Messina, G. Amato, F. Falchi, C. Gennaro, S. Marchand-Maillet
{"title":"基于变换编码器深度特征的高效跨模态视觉文本检索","authors":"Nicola Messina, G. Amato, F. Falchi, C. Gennaro, S. Marchand-Maillet","doi":"10.1109/CBMI50038.2021.9461890","DOIUrl":null,"url":null,"abstract":"Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction.","PeriodicalId":289262,"journal":{"name":"2021 International Conference on Content-Based Multimedia Indexing (CBMI)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features\",\"authors\":\"Nicola Messina, G. Amato, F. Falchi, C. Gennaro, S. Marchand-Maillet\",\"doi\":\"10.1109/CBMI50038.2021.9461890\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction.\",\"PeriodicalId\":289262,\"journal\":{\"name\":\"2021 International Conference on Content-Based Multimedia Indexing (CBMI)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Content-Based Multimedia Indexing (CBMI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CBMI50038.2021.9461890\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Content-Based Multimedia Indexing (CBMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMI50038.2021.9461890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

跨模式检索是现代搜索引擎中的一项重要功能,因为它允许查询和检索的对象属于不同的模式,从而增加了用户体验。在本文中,我们专注于图像句子检索任务,其目标是有效地找到给定句子的相关图像(图像检索)或给定图像的相关句子(句子检索)。计算机视觉文献报道了使用具有注意和自注意机制的深度神经网络在图像-句子匹配任务上的最佳效果。他们通过对整个数据集执行顺序扫描来评估检索任务上的匹配性能。随着图像或标题数量的增加,这种方法不能很好地扩展。在这项工作中,我们探索了不同的预处理技术来产生稀疏的深度多模态特征,从最先进的深度学习架构中提取它们用于图像-文本匹配。我们的主要目标是为复杂的多模态描述的有效索引奠定路径。我们使用最近引入的TERN架构作为图像-句子特征提取器。它旨在生成描述整个图像和句子的固定大小的1024-d向量,以及描述两种模态(分别为图像区域和句子单词)的各种构建成分的可变长度的1024-d向量集。所有这些向量都被TERN设计强制置于相同的公共空间中。我们的实验对所探索的方法显示出有趣的初步结果,并建议在这一重要的研究方向上进行进一步的实验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features
Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信