一种具有跨模态关系和分层交互关注的语义理解多模态多任务框架

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-08-19 DOI:10.1016/j.inffus.2025.103628

Mohammad Zia Ur Rehman , Devraj Raghuvanshi , Umang Jain , Shubhi Bansal , Nagendra Kumar

{"title":"一种具有跨模态关系和分层交互关注的语义理解多模态多任务框架","authors":"Mohammad Zia Ur Rehman , Devraj Raghuvanshi , Umang Jain , Shubhi Bansal , Nagendra Kumar","doi":"10.1016/j.inffus.2025.103628","DOIUrl":null,"url":null,"abstract":"<div><div>A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal features to acquire multimodal representations. The features are reconstructed based on the node neighborhood, where the neighborhood is decided by the features of a different modality. We also propose Hierarchical Interactive Monomodal Attention (HIMA) to focus on pertinent information within a modality. While cross-modal relation graphs help comprehend high-order relationships between two modalities, HIMA helps in multitasking by learning discriminative features of individual modalities before late-fusing them. Finally, extensive experimental evaluation on three datasets demonstrates that the proposed approach effectively comprehends multimodal content for multiple tasks. The code is available in the GitHub repository. <span><span>https://github.com/devraj-raghuvanshi/MM-ORIENT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"126 ","pages":"Article 103628"},"PeriodicalIF":15.5000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A multimodal–multitask framework with cross-modal relation and hierarchical interactive attention for semantic comprehension\",\"authors\":\"Mohammad Zia Ur Rehman , Devraj Raghuvanshi , Umang Jain , Shubhi Bansal , Nagendra Kumar\",\"doi\":\"10.1016/j.inffus.2025.103628\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal features to acquire multimodal representations. The features are reconstructed based on the node neighborhood, where the neighborhood is decided by the features of a different modality. We also propose Hierarchical Interactive Monomodal Attention (HIMA) to focus on pertinent information within a modality. While cross-modal relation graphs help comprehend high-order relationships between two modalities, HIMA helps in multitasking by learning discriminative features of individual modalities before late-fusing them. Finally, extensive experimental evaluation on three datasets demonstrates that the proposed approach effectively comprehends multimodal content for multiple tasks. The code is available in the GitHub repository. <span><span>https://github.com/devraj-raghuvanshi/MM-ORIENT</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"126 \",\"pages\":\"Article 103628\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525007006\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525007006","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

多模态学习的一个主要挑战是个体模态中存在噪声。这种噪声固有地影响了产生的多模态表示，特别是当这些表示是通过不同模态之间的显式相互作用获得的。此外，多模态融合技术虽然旨在实现强联合表示，但可能忽略了单个模态中有价值的判别信息。为此，我们提出了一个具有跨模态关系和分层交互注意（MM-ORIENT）的多模态-多任务框架，该框架对多任务有效。该方法采用交叉模态获取多模态表征，而不同模态之间没有明显的相互作用，从而降低了潜伏阶段的噪声影响。为了实现这一点，我们提出了重建单模特征以获得多模态表示的跨模态关系图。基于节点邻域重构特征，其中邻域由不同模态的特征决定。我们还提出了层次交互单模注意（HIMA）来关注模态中的相关信息。虽然跨模态关系图有助于理解两个模态之间的高阶关系，但HIMA通过在后期融合之前学习单个模态的判别特征来帮助进行多任务处理。最后，在三个数据集上进行了广泛的实验评估，结果表明该方法能够有效地理解多任务的多模态内容。代码可在GitHub存储库中获得。https://github.com/devraj-raghuvanshi/MM-ORIENT。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A multimodal–multitask framework with cross-modal relation and hierarchical interactive attention for semantic comprehension

A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal features to acquire multimodal representations. The features are reconstructed based on the node neighborhood, where the neighborhood is decided by the features of a different modality. We also propose Hierarchical Interactive Monomodal Attention (HIMA) to focus on pertinent information within a modality. While cross-modal relation graphs help comprehend high-order relationships between two modalities, HIMA helps in multitasking by learning discriminative features of individual modalities before late-fusing them. Finally, extensive experimental evaluation on three datasets demonstrates that the proposed approach effectively comprehends multimodal content for multiple tasks. The code is available in the GitHub repository. https://github.com/devraj-raghuvanshi/MM-ORIENT.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.