Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images

Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining Pub Date : 2018-02-02 DOI:10.1145/3159652.3159735

Saeid Balaneshin Kordan, Alexander Kotov

{"title":"Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images","authors":"Saeid Balaneshin Kordan, Alexander Kotov","doi":"10.1145/3159652.3159735","DOIUrl":null,"url":null,"abstract":"Recent advances in deep learning and distributed representations of images and text have resulted in the emergence of several neural architectures for cross-modal retrieval tasks, such as searching collections of images in response to textual queries and assigning textual descriptions to images. However, the multi-modal retrieval scenario, when a query can be either a text or an image and the goal is to retrieve both a textual fragment and an image, which should be considered as an atomic unit, has been significantly less studied. In this paper, we propose a gated neural architecture to project image and keyword queries as well as multi-modal retrieval units into the same low-dimensional embedding space and perform semantic matching in this space. The proposed architecture is trained to minimize structured hinge loss and can be applied to both cross- and multi-modal retrieval. Experimental results for six different cross- and multi-modal retrieval tasks obtained on publicly available datasets indicate superior retrieval accuracy of the proposed architecture in comparison to the state-of-art baselines.","PeriodicalId":401247,"journal":{"name":"Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining","volume":"136 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3159652.3159735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

Recent advances in deep learning and distributed representations of images and text have resulted in the emergence of several neural architectures for cross-modal retrieval tasks, such as searching collections of images in response to textual queries and assigning textual descriptions to images. However, the multi-modal retrieval scenario, when a query can be either a text or an image and the goal is to retrieve both a textual fragment and an image, which should be considered as an atomic unit, has been significantly less studied. In this paper, we propose a gated neural architecture to project image and keyword queries as well as multi-modal retrieval units into the same low-dimensional embedding space and perform semantic matching in this space. The proposed architecture is trained to minimize structured hinge loss and can be applied to both cross- and multi-modal retrieval. Experimental results for six different cross- and multi-modal retrieval tasks obtained on publicly available datasets indicate superior retrieval accuracy of the proposed architecture in comparison to the state-of-art baselines.

查看原文本刊更多论文

基于联合嵌入空间的文本和图像多模态检索的深度神经结构

深度学习和图像和文本的分布式表示的最新进展导致了跨模式检索任务的几种神经架构的出现，例如响应文本查询搜索图像集合和为图像分配文本描述。然而，当查询既可以是文本也可以是图像，并且目标是检索文本片段和图像(应被视为原子单元)时，对多模式检索场景的研究明显较少。在本文中，我们提出了一种门控神经结构，将图像和关键词查询以及多模态检索单元投影到相同的低维嵌入空间中，并在该空间中进行语义匹配。所提出的架构经过训练以最小化结构铰链损失，并且可以应用于交叉和多模态检索。在公开可用的数据集上对六种不同的交叉和多模态检索任务进行的实验结果表明，与目前的基线相比，所提出的架构具有更高的检索精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量