{"title":"基于双聚合的联合模态相似性哈希跨模态检索","authors":"Le Xu, Jun Yin","doi":"10.1016/j.neunet.2025.108069","DOIUrl":null,"url":null,"abstract":"<div><div>Cross-modal hashing aims to leverage hashing functions to map multimodal data into a unified low-dimensional space, realizing efficient cross-modal retrieval. In particular, unsupervised cross-modal hashing methods attract significant attention for not needing external label information. However, in the field of unsupervised cross-modal hashing, there are several pressing issues to address: (1) how to facilitate semantic alignment between modalities, and (2) how to effectively capture the intrinsic relationships between data, thereby constructing a more reliable affinity matrix to assist in the learning of hash codes. In this paper, Dual Aggregation-Based Joint-modal Similarity Hashing (DAJSH) is proposed to overcome these challenges. To enhance cross-modal semantic alignment, we employ a Transformer encoder to fuse image and text features and introduce a contrastive loss to optimize cross-modal consistency. Additionally, for constructing a more reliable affinity matrix to assist hash code learning, we propose a dual-aggregation affinity matrix construction scheme. This scheme integrates intra-modal cosine similarity and Euclidean distance while incorporating cross-modal similarity, thereby maximally preserving cross-modal semantic information. Experimental results demonstrate that our method achieves performance improvements of 1.9 % <span><math><mo>∼</mo></math></span> 5.1 %, 0.9 % <span><math><mo>∼</mo></math></span> 5.8 % and 0.6 % <span><math><mo>∼</mo></math></span> 2.6 % over state-of-the-art approaches on the MIR Flickr, NUS-WIDE and MS COCO benchmark datasets, respectively.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"193 ","pages":"Article 108069"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dual aggregation based joint-modal similarity hashing for cross-modal retrieval\",\"authors\":\"Le Xu, Jun Yin\",\"doi\":\"10.1016/j.neunet.2025.108069\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Cross-modal hashing aims to leverage hashing functions to map multimodal data into a unified low-dimensional space, realizing efficient cross-modal retrieval. In particular, unsupervised cross-modal hashing methods attract significant attention for not needing external label information. However, in the field of unsupervised cross-modal hashing, there are several pressing issues to address: (1) how to facilitate semantic alignment between modalities, and (2) how to effectively capture the intrinsic relationships between data, thereby constructing a more reliable affinity matrix to assist in the learning of hash codes. In this paper, Dual Aggregation-Based Joint-modal Similarity Hashing (DAJSH) is proposed to overcome these challenges. To enhance cross-modal semantic alignment, we employ a Transformer encoder to fuse image and text features and introduce a contrastive loss to optimize cross-modal consistency. Additionally, for constructing a more reliable affinity matrix to assist hash code learning, we propose a dual-aggregation affinity matrix construction scheme. This scheme integrates intra-modal cosine similarity and Euclidean distance while incorporating cross-modal similarity, thereby maximally preserving cross-modal semantic information. Experimental results demonstrate that our method achieves performance improvements of 1.9 % <span><math><mo>∼</mo></math></span> 5.1 %, 0.9 % <span><math><mo>∼</mo></math></span> 5.8 % and 0.6 % <span><math><mo>∼</mo></math></span> 2.6 % over state-of-the-art approaches on the MIR Flickr, NUS-WIDE and MS COCO benchmark datasets, respectively.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"193 \",\"pages\":\"Article 108069\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025009499\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009499","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Dual aggregation based joint-modal similarity hashing for cross-modal retrieval
Cross-modal hashing aims to leverage hashing functions to map multimodal data into a unified low-dimensional space, realizing efficient cross-modal retrieval. In particular, unsupervised cross-modal hashing methods attract significant attention for not needing external label information. However, in the field of unsupervised cross-modal hashing, there are several pressing issues to address: (1) how to facilitate semantic alignment between modalities, and (2) how to effectively capture the intrinsic relationships between data, thereby constructing a more reliable affinity matrix to assist in the learning of hash codes. In this paper, Dual Aggregation-Based Joint-modal Similarity Hashing (DAJSH) is proposed to overcome these challenges. To enhance cross-modal semantic alignment, we employ a Transformer encoder to fuse image and text features and introduce a contrastive loss to optimize cross-modal consistency. Additionally, for constructing a more reliable affinity matrix to assist hash code learning, we propose a dual-aggregation affinity matrix construction scheme. This scheme integrates intra-modal cosine similarity and Euclidean distance while incorporating cross-modal similarity, thereby maximally preserving cross-modal semantic information. Experimental results demonstrate that our method achieves performance improvements of 1.9 % 5.1 %, 0.9 % 5.8 % and 0.6 % 2.6 % over state-of-the-art approaches on the MIR Flickr, NUS-WIDE and MS COCO benchmark datasets, respectively.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.