{"title":"跨模态检索的统一语义空间学习","authors":"Jie Zhu , Jianan Liu , Shufang Wu , Feng Zhang","doi":"10.1016/j.neunet.2025.107756","DOIUrl":null,"url":null,"abstract":"<div><div>With the increasing amount of multimodal data on the Internet, cross-modal retrieval has gradually become a hot research topic and has achieved significant progress, especially since graph convolutional networks were introduced. Most methods based on graph convolutional networks tend to focus on incorporating the correlations among samples and the correlations among labels into the common representations, but neglect the correlations among the semantic contents. Moreover, the semantic similarity between instances and semantic contents is also underutilized. To address these issues, we propose a Unified Semantic Space Learning (USSL) method, which not only explores the correlations of the semantic contents but also maps images, texts, labels, and multi-labels into a unified semantic space, facilitating the calculation of similarities between samples and between samples and semantic contents. To fully explore the correlations of the semantic contents, we construct a label-multi-label graph and learn the correlations of the semantic contents in a data-driven manner using our proposed Group Semantic Sharing Graph Convolutional Network. Furthermore, we propose an isomorphic InfoNCE loss to bridge the heterogeneity gap between the samples and semantic contents, along with an intra-modality InfoNCE loss and an inter-modality InfoNCE loss to maintain the semantic and structural consistencies of the learned modality-invariant common representations. Through comparative experiments on three representative cross-modal datasets, we have demonstrated the superiority of our proposed method.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"191 ","pages":"Article 107756"},"PeriodicalIF":6.0000,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unified semantic space learning for cross-modal retrieval\",\"authors\":\"Jie Zhu , Jianan Liu , Shufang Wu , Feng Zhang\",\"doi\":\"10.1016/j.neunet.2025.107756\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>With the increasing amount of multimodal data on the Internet, cross-modal retrieval has gradually become a hot research topic and has achieved significant progress, especially since graph convolutional networks were introduced. Most methods based on graph convolutional networks tend to focus on incorporating the correlations among samples and the correlations among labels into the common representations, but neglect the correlations among the semantic contents. Moreover, the semantic similarity between instances and semantic contents is also underutilized. To address these issues, we propose a Unified Semantic Space Learning (USSL) method, which not only explores the correlations of the semantic contents but also maps images, texts, labels, and multi-labels into a unified semantic space, facilitating the calculation of similarities between samples and between samples and semantic contents. To fully explore the correlations of the semantic contents, we construct a label-multi-label graph and learn the correlations of the semantic contents in a data-driven manner using our proposed Group Semantic Sharing Graph Convolutional Network. Furthermore, we propose an isomorphic InfoNCE loss to bridge the heterogeneity gap between the samples and semantic contents, along with an intra-modality InfoNCE loss and an inter-modality InfoNCE loss to maintain the semantic and structural consistencies of the learned modality-invariant common representations. Through comparative experiments on three representative cross-modal datasets, we have demonstrated the superiority of our proposed method.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"191 \",\"pages\":\"Article 107756\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025006367\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025006367","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Unified semantic space learning for cross-modal retrieval
With the increasing amount of multimodal data on the Internet, cross-modal retrieval has gradually become a hot research topic and has achieved significant progress, especially since graph convolutional networks were introduced. Most methods based on graph convolutional networks tend to focus on incorporating the correlations among samples and the correlations among labels into the common representations, but neglect the correlations among the semantic contents. Moreover, the semantic similarity between instances and semantic contents is also underutilized. To address these issues, we propose a Unified Semantic Space Learning (USSL) method, which not only explores the correlations of the semantic contents but also maps images, texts, labels, and multi-labels into a unified semantic space, facilitating the calculation of similarities between samples and between samples and semantic contents. To fully explore the correlations of the semantic contents, we construct a label-multi-label graph and learn the correlations of the semantic contents in a data-driven manner using our proposed Group Semantic Sharing Graph Convolutional Network. Furthermore, we propose an isomorphic InfoNCE loss to bridge the heterogeneity gap between the samples and semantic contents, along with an intra-modality InfoNCE loss and an inter-modality InfoNCE loss to maintain the semantic and structural consistencies of the learned modality-invariant common representations. Through comparative experiments on three representative cross-modal datasets, we have demonstrated the superiority of our proposed method.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.