{"title":"Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval","authors":"Zhixian Zeng, Jianjun Cao, Guoquan Jiang, Nianfeng Weng, Yuxin Xu, Zibo Nie","doi":"10.1145/3487075.3487167","DOIUrl":null,"url":null,"abstract":"Visual semantic embedding network or cross-modal cross-attention network are usually adopted for image-text retrieval. Existing works have confirmed that both visual semantic embedding network and cross-modal cross-attention network can achieve similar performance, but the former has lower computational complexity so that its retrieval speed is faster and its engineering application value is higher than the latter. In this paper, we propose a Super Visual Semantic Embedding Network (SVSEN) for cross-modal image-text retrieval, which contains two independent branch substructures including the image embedding network and the text embedding network. In the design of the image embedding network, firstly, a feature extraction network is employed to extract the fine-grained features of the image. Then, we design a graph attention mechanism module with residual link for image semantic enhancement. Finally, the Softmax pooling strategy is used to map the image fine-grained features to a common embedding space. In the design of the text embedding network, we use the pre-trained BERT-base-uncased to extract context-related word vectors, which will be fine-tuned in training. Finally, the fine-grained word vectors are mapped to a common embedding space by a maximum pooling. In the common embedding space, a soft label-based triplet loss function is adopted for cross-modal semantic alignment learning. Through experimental verification on two widely used datasets, namely MS-COCO and Flickr-30K, our proposed SVSEN achieves the best performance. For instance, on Flickr-30K, our SVSEN outperforms image retrieval by 3.91% relatively and text retrieval by 1.96% relatively (R@1).","PeriodicalId":354966,"journal":{"name":"Proceedings of the 5th International Conference on Computer Science and Application Engineering","volume":"50 17","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Conference on Computer Science and Application Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3487075.3487167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Visual semantic embedding network or cross-modal cross-attention network are usually adopted for image-text retrieval. Existing works have confirmed that both visual semantic embedding network and cross-modal cross-attention network can achieve similar performance, but the former has lower computational complexity so that its retrieval speed is faster and its engineering application value is higher than the latter. In this paper, we propose a Super Visual Semantic Embedding Network (SVSEN) for cross-modal image-text retrieval, which contains two independent branch substructures including the image embedding network and the text embedding network. In the design of the image embedding network, firstly, a feature extraction network is employed to extract the fine-grained features of the image. Then, we design a graph attention mechanism module with residual link for image semantic enhancement. Finally, the Softmax pooling strategy is used to map the image fine-grained features to a common embedding space. In the design of the text embedding network, we use the pre-trained BERT-base-uncased to extract context-related word vectors, which will be fine-tuned in training. Finally, the fine-grained word vectors are mapped to a common embedding space by a maximum pooling. In the common embedding space, a soft label-based triplet loss function is adopted for cross-modal semantic alignment learning. Through experimental verification on two widely used datasets, namely MS-COCO and Flickr-30K, our proposed SVSEN achieves the best performance. For instance, on Flickr-30K, our SVSEN outperforms image retrieval by 3.91% relatively and text retrieval by 1.96% relatively (R@1).