{"title":"用于跨模态检索的特定模态自适应缩放和注意力网络","authors":"Xiao Ke, Baitao Chen, Yuhang Cai, Hao Liu, Wenzhong Guo, Weibin Chen","doi":"10.1016/j.neucom.2024.128664","DOIUrl":null,"url":null,"abstract":"<div><div>There are huge differences in data distribution and feature representation of different modalities. How to flexibly and accurately retrieve data from different modalities is a challenging problem. The mainstream common subspace methods only focus on the heterogeneity gap, and use a unified method to jointly learn the common representation of different modalities, which can easily lead to the difficulty of multi-modal unified fitting. In this work, we innovatively propose the concept of multi-modal information density discrepancy, and propose a modality-specific adaptive scaling method incorporating prior knowledge, which can adaptively learn the most suitable network for different modalities. Secondly, for the problem of efficient semantic fusion and interference features, we propose a multi-level modal feature attention mechanism, which realizes the efficient fusion of text semantics through attention mechanism, explicitly captures and shields the interference features from multiple scales. In addition, to address the bottleneck of cross-modal retrieval task caused by the insufficient quality of multimodal common subspace and the defects of Transformer structure, this paper proposes a cross-level interaction injection mechanism to fuse multi-level patch interactions without affecting the pre-trained model to construct higher quality latent representation spaces and multimodal common subspaces. Comprehensive experimental results on four widely used cross-modal retrieval datasets show the proposed MASAN achieves the state-of-the-art results and significantly outperforms other existing methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Modality-specific adaptive scaling and attention network for cross-modal retrieval\",\"authors\":\"Xiao Ke, Baitao Chen, Yuhang Cai, Hao Liu, Wenzhong Guo, Weibin Chen\",\"doi\":\"10.1016/j.neucom.2024.128664\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>There are huge differences in data distribution and feature representation of different modalities. How to flexibly and accurately retrieve data from different modalities is a challenging problem. The mainstream common subspace methods only focus on the heterogeneity gap, and use a unified method to jointly learn the common representation of different modalities, which can easily lead to the difficulty of multi-modal unified fitting. In this work, we innovatively propose the concept of multi-modal information density discrepancy, and propose a modality-specific adaptive scaling method incorporating prior knowledge, which can adaptively learn the most suitable network for different modalities. Secondly, for the problem of efficient semantic fusion and interference features, we propose a multi-level modal feature attention mechanism, which realizes the efficient fusion of text semantics through attention mechanism, explicitly captures and shields the interference features from multiple scales. In addition, to address the bottleneck of cross-modal retrieval task caused by the insufficient quality of multimodal common subspace and the defects of Transformer structure, this paper proposes a cross-level interaction injection mechanism to fuse multi-level patch interactions without affecting the pre-trained model to construct higher quality latent representation spaces and multimodal common subspaces. Comprehensive experimental results on four widely used cross-modal retrieval datasets show the proposed MASAN achieves the state-of-the-art results and significantly outperforms other existing methods.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224014358\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224014358","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Modality-specific adaptive scaling and attention network for cross-modal retrieval
There are huge differences in data distribution and feature representation of different modalities. How to flexibly and accurately retrieve data from different modalities is a challenging problem. The mainstream common subspace methods only focus on the heterogeneity gap, and use a unified method to jointly learn the common representation of different modalities, which can easily lead to the difficulty of multi-modal unified fitting. In this work, we innovatively propose the concept of multi-modal information density discrepancy, and propose a modality-specific adaptive scaling method incorporating prior knowledge, which can adaptively learn the most suitable network for different modalities. Secondly, for the problem of efficient semantic fusion and interference features, we propose a multi-level modal feature attention mechanism, which realizes the efficient fusion of text semantics through attention mechanism, explicitly captures and shields the interference features from multiple scales. In addition, to address the bottleneck of cross-modal retrieval task caused by the insufficient quality of multimodal common subspace and the defects of Transformer structure, this paper proposes a cross-level interaction injection mechanism to fuse multi-level patch interactions without affecting the pre-trained model to construct higher quality latent representation spaces and multimodal common subspaces. Comprehensive experimental results on four widely used cross-modal retrieval datasets show the proposed MASAN achieves the state-of-the-art results and significantly outperforms other existing methods.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.