Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval

Proceedings of the 5th International Conference on Computer Science and Application Engineering Pub Date : 2021-10-19 DOI:10.1145/3487075.3487167

Zhixian Zeng, Jianjun Cao, Guoquan Jiang, Nianfeng Weng, Yuxin Xu, Zibo Nie

{"title":"Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval","authors":"Zhixian Zeng, Jianjun Cao, Guoquan Jiang, Nianfeng Weng, Yuxin Xu, Zibo Nie","doi":"10.1145/3487075.3487167","DOIUrl":null,"url":null,"abstract":"Visual semantic embedding network or cross-modal cross-attention network are usually adopted for image-text retrieval. Existing works have confirmed that both visual semantic embedding network and cross-modal cross-attention network can achieve similar performance, but the former has lower computational complexity so that its retrieval speed is faster and its engineering application value is higher than the latter. In this paper, we propose a Super Visual Semantic Embedding Network (SVSEN) for cross-modal image-text retrieval, which contains two independent branch substructures including the image embedding network and the text embedding network. In the design of the image embedding network, firstly, a feature extraction network is employed to extract the fine-grained features of the image. Then, we design a graph attention mechanism module with residual link for image semantic enhancement. Finally, the Softmax pooling strategy is used to map the image fine-grained features to a common embedding space. In the design of the text embedding network, we use the pre-trained BERT-base-uncased to extract context-related word vectors, which will be fine-tuned in training. Finally, the fine-grained word vectors are mapped to a common embedding space by a maximum pooling. In the common embedding space, a soft label-based triplet loss function is adopted for cross-modal semantic alignment learning. Through experimental verification on two widely used datasets, namely MS-COCO and Flickr-30K, our proposed SVSEN achieves the best performance. For instance, on Flickr-30K, our SVSEN outperforms image retrieval by 3.91% relatively and text retrieval by 1.96% relatively (R@1).","PeriodicalId":354966,"journal":{"name":"Proceedings of the 5th International Conference on Computer Science and Application Engineering","volume":"50 17","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Conference on Computer Science and Application Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3487075.3487167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Visual semantic embedding network or cross-modal cross-attention network are usually adopted for image-text retrieval. Existing works have confirmed that both visual semantic embedding network and cross-modal cross-attention network can achieve similar performance, but the former has lower computational complexity so that its retrieval speed is faster and its engineering application value is higher than the latter. In this paper, we propose a Super Visual Semantic Embedding Network (SVSEN) for cross-modal image-text retrieval, which contains two independent branch substructures including the image embedding network and the text embedding network. In the design of the image embedding network, firstly, a feature extraction network is employed to extract the fine-grained features of the image. Then, we design a graph attention mechanism module with residual link for image semantic enhancement. Finally, the Softmax pooling strategy is used to map the image fine-grained features to a common embedding space. In the design of the text embedding network, we use the pre-trained BERT-base-uncased to extract context-related word vectors, which will be fine-tuned in training. Finally, the fine-grained word vectors are mapped to a common embedding space by a maximum pooling. In the common embedding space, a soft label-based triplet loss function is adopted for cross-modal semantic alignment learning. Through experimental verification on two widely used datasets, namely MS-COCO and Flickr-30K, our proposed SVSEN achieves the best performance. For instance, on Flickr-30K, our SVSEN outperforms image retrieval by 3.91% relatively and text retrieval by 1.96% relatively (R@1).

查看原文本刊更多论文

跨模态图像-文本检索的超视觉语义嵌入

图像文本检索通常采用视觉语义嵌入网络或跨模态交叉注意网络。已有研究证实，视觉语义嵌入网络和跨模态交叉关注网络都可以达到相似的性能，但前者的计算复杂度较低，因此其检索速度更快，工程应用价值更高。本文提出了一种用于跨模态图像-文本检索的超视觉语义嵌入网络(SVSEN)，该网络包含图像嵌入网络和文本嵌入网络两个独立的分支子结构。在图像嵌入网络的设计中，首先利用特征提取网络提取图像的细粒度特征;然后，我们设计了一个带有残差链接的图注意机制模块，用于图像语义增强。最后，利用Softmax池化策略将图像细粒度特征映射到公共嵌入空间。在文本嵌入网络的设计中，我们使用预先训练好的bert -base-uncase提取上下文相关的词向量，并在训练中进行微调。最后，通过最大池化将细粒度词向量映射到公共嵌入空间。在公共嵌入空间中，采用基于软标签的三重损失函数进行跨模态语义对齐学习。通过在MS-COCO和Flickr-30K两个广泛使用的数据集上的实验验证，我们提出的SVSEN达到了最好的性能。例如，在Flickr-30K上，我们的SVSEN相对优于图像检索3.91%，相对优于文本检索1.96% (R@1)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th International Conference on Computer Science and Application Engineering

自引率

0.00%

发文量