{"title":"Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency","authors":"Xiaoqing Liu;Huanqiang Zeng;Yifan Shi;Jianqing Zhu;Kaixiang Yang;Zhiwen Yu","doi":"10.1109/TMM.2025.3535378","DOIUrl":null,"url":null,"abstract":"In the swiftly advancing realm of information retrieval, unsupervised cross-modal hashing has emerged as a focal point of research, taking advantage of the inherent advantages of the multifaceted and dynamism inherent in multimedia data. Existing unsupervised cross-modal hashing methods rely mainly on initial pre-trained correlations among cross-modal features, and the inaccurate neighborhood correlations impacts the presentation of common semantics throughout the optimization. To address the aforementioned issues, we propose <bold>E</b>nsemble <bold>P</b>rototype <bold>Net</b>works (EPNet), which delineates class attributes of cross-modal instances through an ensemble clustering methodology. EPNet seeks to extract correlation information between instances by leveraging local correlation aggregation and ensemble clustering from multiple perspectives, aiming to reduce initialization effects and enhance cross-modal representations. Specifically, the local correlation aggregation is first proposed within a batch of semantic affinity relationships to generate a precise and compact hash code among cross-modal instances. Secondly, the ensemble prototype module is employed to discern the class attributes of deep features, thereby aiding the model in extracting more universally applicable feature representations. Thirdly, an early attempt to constrict the representational congruity of local semantic affinity relationships and deep feature ensemble prototype correlations using cross-task consistency loss aims to enhance the representation of cross-modal common semantic features. Finally, EPNet outperforms several state-of-the-art cross-modal retrieval methods on three real-world image-text datasets in extensive experiments.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3476-3488"},"PeriodicalIF":9.7000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855527/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
In the swiftly advancing realm of information retrieval, unsupervised cross-modal hashing has emerged as a focal point of research, taking advantage of the inherent advantages of the multifaceted and dynamism inherent in multimedia data. Existing unsupervised cross-modal hashing methods rely mainly on initial pre-trained correlations among cross-modal features, and the inaccurate neighborhood correlations impacts the presentation of common semantics throughout the optimization. To address the aforementioned issues, we propose Ensemble Prototype Networks (EPNet), which delineates class attributes of cross-modal instances through an ensemble clustering methodology. EPNet seeks to extract correlation information between instances by leveraging local correlation aggregation and ensemble clustering from multiple perspectives, aiming to reduce initialization effects and enhance cross-modal representations. Specifically, the local correlation aggregation is first proposed within a batch of semantic affinity relationships to generate a precise and compact hash code among cross-modal instances. Secondly, the ensemble prototype module is employed to discern the class attributes of deep features, thereby aiding the model in extracting more universally applicable feature representations. Thirdly, an early attempt to constrict the representational congruity of local semantic affinity relationships and deep feature ensemble prototype correlations using cross-task consistency loss aims to enhance the representation of cross-modal common semantic features. Finally, EPNet outperforms several state-of-the-art cross-modal retrieval methods on three real-world image-text datasets in extensive experiments.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.