Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI:10.1109/TMM.2025.3535378

Xiaoqing Liu;Huanqiang Zeng;Yifan Shi;Jianqing Zhu;Kaixiang Yang;Zhiwen Yu

{"title":"Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency","authors":"Xiaoqing Liu;Huanqiang Zeng;Yifan Shi;Jianqing Zhu;Kaixiang Yang;Zhiwen Yu","doi":"10.1109/TMM.2025.3535378","DOIUrl":null,"url":null,"abstract":"In the swiftly advancing realm of information retrieval, unsupervised cross-modal hashing has emerged as a focal point of research, taking advantage of the inherent advantages of the multifaceted and dynamism inherent in multimedia data. Existing unsupervised cross-modal hashing methods rely mainly on initial pre-trained correlations among cross-modal features, and the inaccurate neighborhood correlations impacts the presentation of common semantics throughout the optimization. To address the aforementioned issues, we propose <bold>E</b>nsemble <bold>P</b>rototype <bold>Net</b>works (EPNet), which delineates class attributes of cross-modal instances through an ensemble clustering methodology. EPNet seeks to extract correlation information between instances by leveraging local correlation aggregation and ensemble clustering from multiple perspectives, aiming to reduce initialization effects and enhance cross-modal representations. Specifically, the local correlation aggregation is first proposed within a batch of semantic affinity relationships to generate a precise and compact hash code among cross-modal instances. Secondly, the ensemble prototype module is employed to discern the class attributes of deep features, thereby aiding the model in extracting more universally applicable feature representations. Thirdly, an early attempt to constrict the representational congruity of local semantic affinity relationships and deep feature ensemble prototype correlations using cross-task consistency loss aims to enhance the representation of cross-modal common semantic features. Finally, EPNet outperforms several state-of-the-art cross-modal retrieval methods on three real-world image-text datasets in extensive experiments.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3476-3488"},"PeriodicalIF":9.7000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855527/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In the swiftly advancing realm of information retrieval, unsupervised cross-modal hashing has emerged as a focal point of research, taking advantage of the inherent advantages of the multifaceted and dynamism inherent in multimedia data. Existing unsupervised cross-modal hashing methods rely mainly on initial pre-trained correlations among cross-modal features, and the inaccurate neighborhood correlations impacts the presentation of common semantics throughout the optimization. To address the aforementioned issues, we propose Ensemble Prototype Networks (EPNet), which delineates class attributes of cross-modal instances through an ensemble clustering methodology. EPNet seeks to extract correlation information between instances by leveraging local correlation aggregation and ensemble clustering from multiple perspectives, aiming to reduce initialization effects and enhance cross-modal representations. Specifically, the local correlation aggregation is first proposed within a batch of semantic affinity relationships to generate a precise and compact hash code among cross-modal instances. Secondly, the ensemble prototype module is employed to discern the class attributes of deep features, thereby aiding the model in extracting more universally applicable feature representations. Thirdly, an early attempt to constrict the representational congruity of local semantic affinity relationships and deep feature ensemble prototype correlations using cross-task consistency loss aims to enhance the representation of cross-modal common semantic features. Finally, EPNet outperforms several state-of-the-art cross-modal retrieval methods on three real-world image-text datasets in extensive experiments.

查看原文本刊更多论文

具有跨任务一致性的无监督跨模态哈希集成原型网络

在快速发展的信息检索领域，利用多媒体数据的多面性和动态性的固有优势，无监督跨模态散列已成为研究的焦点。现有的无监督跨模态哈希方法主要依赖于初始预训练的跨模态特征之间的相关性，而不准确的邻域相关性影响了整个优化过程中公共语义的表示。为了解决上述问题，我们提出了集成原型网络（EPNet），它通过集成聚类方法描述跨模态实例的类属性。EPNet试图从多个角度利用局部相关聚合和集成聚类来提取实例之间的相关信息，旨在减少初始化效应并增强跨模态表示。具体而言，首先在一批语义亲和关系中提出局部关联聚合，以在跨模态实例之间生成精确、紧凑的哈希码。其次，利用集成原型模块识别深层特征的类属性，从而帮助模型提取更普遍适用的特征表示；第三，早期尝试使用跨任务一致性损失来压缩局部语义亲和关系和深度特征集成原型关联的表征一致性，旨在增强跨模态共同语义特征的表征。最后，在广泛的实验中，EPNet在三个真实世界的图像-文本数据集上优于几种最先进的跨模态检索方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.