基于变换编码器深度特征的高效跨模态视觉文本检索

2021 International Conference on Content-Based Multimedia Indexing (CBMI) Pub Date : 2021-06-01 DOI:10.1109/CBMI50038.2021.9461890

Nicola Messina, G. Amato, F. Falchi, C. Gennaro, S. Marchand-Maillet

{"title":"基于变换编码器深度特征的高效跨模态视觉文本检索","authors":"Nicola Messina, G. Amato, F. Falchi, C. Gennaro, S. Marchand-Maillet","doi":"10.1109/CBMI50038.2021.9461890","DOIUrl":null,"url":null,"abstract":"Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction.","PeriodicalId":289262,"journal":{"name":"2021 International Conference on Content-Based Multimedia Indexing (CBMI)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features\",\"authors\":\"Nicola Messina, G. Amato, F. Falchi, C. Gennaro, S. Marchand-Maillet\",\"doi\":\"10.1109/CBMI50038.2021.9461890\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction.\",\"PeriodicalId\":289262,\"journal\":{\"name\":\"2021 International Conference on Content-Based Multimedia Indexing (CBMI)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Content-Based Multimedia Indexing (CBMI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CBMI50038.2021.9461890\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Content-Based Multimedia Indexing (CBMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMI50038.2021.9461890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

跨模式检索是现代搜索引擎中的一项重要功能，因为它允许查询和检索的对象属于不同的模式，从而增加了用户体验。在本文中，我们专注于图像句子检索任务，其目标是有效地找到给定句子的相关图像(图像检索)或给定图像的相关句子(句子检索)。计算机视觉文献报道了使用具有注意和自注意机制的深度神经网络在图像-句子匹配任务上的最佳效果。他们通过对整个数据集执行顺序扫描来评估检索任务上的匹配性能。随着图像或标题数量的增加，这种方法不能很好地扩展。在这项工作中，我们探索了不同的预处理技术来产生稀疏的深度多模态特征，从最先进的深度学习架构中提取它们用于图像-文本匹配。我们的主要目标是为复杂的多模态描述的有效索引奠定路径。我们使用最近引入的TERN架构作为图像-句子特征提取器。它旨在生成描述整个图像和句子的固定大小的1024-d向量，以及描述两种模态(分别为图像区域和句子单词)的各种构建成分的可变长度的1024-d向量集。所有这些向量都被TERN设计强制置于相同的公共空间中。我们的实验对所探索的方法显示出有趣的初步结果，并建议在这一重要的研究方向上进行进一步的实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features

Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 International Conference on Content-Based Multimedia Indexing (CBMI)

自引率

0.00%

发文量