Modality-specific adaptive scaling and attention network for cross-modal retrieval

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2024-10-05 DOI:10.1016/j.neucom.2024.128664

Xiao Ke, Baitao Chen, Yuhang Cai, Hao Liu, Wenzhong Guo, Weibin Chen

{"title":"Modality-specific adaptive scaling and attention network for cross-modal retrieval","authors":"Xiao Ke, Baitao Chen, Yuhang Cai, Hao Liu, Wenzhong Guo, Weibin Chen","doi":"10.1016/j.neucom.2024.128664","DOIUrl":null,"url":null,"abstract":"<div><div>There are huge differences in data distribution and feature representation of different modalities. How to flexibly and accurately retrieve data from different modalities is a challenging problem. The mainstream common subspace methods only focus on the heterogeneity gap, and use a unified method to jointly learn the common representation of different modalities, which can easily lead to the difficulty of multi-modal unified fitting. In this work, we innovatively propose the concept of multi-modal information density discrepancy, and propose a modality-specific adaptive scaling method incorporating prior knowledge, which can adaptively learn the most suitable network for different modalities. Secondly, for the problem of efficient semantic fusion and interference features, we propose a multi-level modal feature attention mechanism, which realizes the efficient fusion of text semantics through attention mechanism, explicitly captures and shields the interference features from multiple scales. In addition, to address the bottleneck of cross-modal retrieval task caused by the insufficient quality of multimodal common subspace and the defects of Transformer structure, this paper proposes a cross-level interaction injection mechanism to fuse multi-level patch interactions without affecting the pre-trained model to construct higher quality latent representation spaces and multimodal common subspaces. Comprehensive experimental results on four widely used cross-modal retrieval datasets show the proposed MASAN achieves the state-of-the-art results and significantly outperforms other existing methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"612 ","pages":"Article 128664"},"PeriodicalIF":5.5000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224014358","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

There are huge differences in data distribution and feature representation of different modalities. How to flexibly and accurately retrieve data from different modalities is a challenging problem. The mainstream common subspace methods only focus on the heterogeneity gap, and use a unified method to jointly learn the common representation of different modalities, which can easily lead to the difficulty of multi-modal unified fitting. In this work, we innovatively propose the concept of multi-modal information density discrepancy, and propose a modality-specific adaptive scaling method incorporating prior knowledge, which can adaptively learn the most suitable network for different modalities. Secondly, for the problem of efficient semantic fusion and interference features, we propose a multi-level modal feature attention mechanism, which realizes the efficient fusion of text semantics through attention mechanism, explicitly captures and shields the interference features from multiple scales. In addition, to address the bottleneck of cross-modal retrieval task caused by the insufficient quality of multimodal common subspace and the defects of Transformer structure, this paper proposes a cross-level interaction injection mechanism to fuse multi-level patch interactions without affecting the pre-trained model to construct higher quality latent representation spaces and multimodal common subspaces. Comprehensive experimental results on four widely used cross-modal retrieval datasets show the proposed MASAN achieves the state-of-the-art results and significantly outperforms other existing methods.

查看原文本刊更多论文

用于跨模态检索的特定模态自适应缩放和注意力网络

不同模态的数据分布和特征表示存在巨大差异。如何灵活准确地检索不同模态的数据是一个具有挑战性的问题。主流的普通子空间方法只关注异质性差距，采用统一方法联合学习不同模态的共同表征，容易导致多模态统一拟合困难。在这项工作中，我们创新性地提出了多模态信息密度差异的概念，并结合先验知识提出了特定模态的自适应缩放方法，可以自适应地学习最适合不同模态的网络。其次，针对高效语义融合和干扰特征问题，我们提出了多级模态特征关注机制，通过关注机制实现文本语义的高效融合，明确捕捉并屏蔽来自多尺度的干扰特征。此外，针对多模态公共子空间质量不足和 Transformer 结构缺陷导致的跨模态检索任务瓶颈，本文提出了跨层次交互注入机制，在不影响预训练模型的前提下融合多层次补丁交互，构建更高质量的潜表征空间和多模态公共子空间。在四个广泛使用的跨模态检索数据集上的综合实验结果表明，所提出的 MASAN 取得了最先进的结果，并显著优于其他现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.