Modality-specific adaptive scaling and attention network for cross-modal retrieval

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Xiao Ke, Baitao Chen, Yuhang Cai, Hao Liu, Wenzhong Guo, Weibin Chen
{"title":"Modality-specific adaptive scaling and attention network for cross-modal retrieval","authors":"Xiao Ke,&nbsp;Baitao Chen,&nbsp;Yuhang Cai,&nbsp;Hao Liu,&nbsp;Wenzhong Guo,&nbsp;Weibin Chen","doi":"10.1016/j.neucom.2024.128664","DOIUrl":null,"url":null,"abstract":"<div><div>There are huge differences in data distribution and feature representation of different modalities. How to flexibly and accurately retrieve data from different modalities is a challenging problem. The mainstream common subspace methods only focus on the heterogeneity gap, and use a unified method to jointly learn the common representation of different modalities, which can easily lead to the difficulty of multi-modal unified fitting. In this work, we innovatively propose the concept of multi-modal information density discrepancy, and propose a modality-specific adaptive scaling method incorporating prior knowledge, which can adaptively learn the most suitable network for different modalities. Secondly, for the problem of efficient semantic fusion and interference features, we propose a multi-level modal feature attention mechanism, which realizes the efficient fusion of text semantics through attention mechanism, explicitly captures and shields the interference features from multiple scales. In addition, to address the bottleneck of cross-modal retrieval task caused by the insufficient quality of multimodal common subspace and the defects of Transformer structure, this paper proposes a cross-level interaction injection mechanism to fuse multi-level patch interactions without affecting the pre-trained model to construct higher quality latent representation spaces and multimodal common subspaces. Comprehensive experimental results on four widely used cross-modal retrieval datasets show the proposed MASAN achieves the state-of-the-art results and significantly outperforms other existing methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224014358","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

There are huge differences in data distribution and feature representation of different modalities. How to flexibly and accurately retrieve data from different modalities is a challenging problem. The mainstream common subspace methods only focus on the heterogeneity gap, and use a unified method to jointly learn the common representation of different modalities, which can easily lead to the difficulty of multi-modal unified fitting. In this work, we innovatively propose the concept of multi-modal information density discrepancy, and propose a modality-specific adaptive scaling method incorporating prior knowledge, which can adaptively learn the most suitable network for different modalities. Secondly, for the problem of efficient semantic fusion and interference features, we propose a multi-level modal feature attention mechanism, which realizes the efficient fusion of text semantics through attention mechanism, explicitly captures and shields the interference features from multiple scales. In addition, to address the bottleneck of cross-modal retrieval task caused by the insufficient quality of multimodal common subspace and the defects of Transformer structure, this paper proposes a cross-level interaction injection mechanism to fuse multi-level patch interactions without affecting the pre-trained model to construct higher quality latent representation spaces and multimodal common subspaces. Comprehensive experimental results on four widely used cross-modal retrieval datasets show the proposed MASAN achieves the state-of-the-art results and significantly outperforms other existing methods.
用于跨模态检索的特定模态自适应缩放和注意力网络
不同模态的数据分布和特征表示存在巨大差异。如何灵活准确地检索不同模态的数据是一个具有挑战性的问题。主流的普通子空间方法只关注异质性差距,采用统一方法联合学习不同模态的共同表征,容易导致多模态统一拟合困难。在这项工作中,我们创新性地提出了多模态信息密度差异的概念,并结合先验知识提出了特定模态的自适应缩放方法,可以自适应地学习最适合不同模态的网络。其次,针对高效语义融合和干扰特征问题,我们提出了多级模态特征关注机制,通过关注机制实现文本语义的高效融合,明确捕捉并屏蔽来自多尺度的干扰特征。此外,针对多模态公共子空间质量不足和 Transformer 结构缺陷导致的跨模态检索任务瓶颈,本文提出了跨层次交互注入机制,在不影响预训练模型的前提下融合多层次补丁交互,构建更高质量的潜表征空间和多模态公共子空间。在四个广泛使用的跨模态检索数据集上的综合实验结果表明,所提出的 MASAN 取得了最先进的结果,并显著优于其他现有方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信