Enhancing Multimodal Learning via Hierarchical Fusion Architecture Search With Inconsistency Mitigation

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-22 DOI:10.1109/TIP.2025.3599673

Kaifang Long;Guoyang Xie;Lianbo Ma;Qing Li;Min Huang;Jianhui Lv;Zhichao Lu

{"title":"Enhancing Multimodal Learning via Hierarchical Fusion Architecture Search With Inconsistency Mitigation","authors":"Kaifang Long;Guoyang Xie;Lianbo Ma;Qing Li;Min Huang;Jianhui Lv;Zhichao Lu","doi":"10.1109/TIP.2025.3599673","DOIUrl":null,"url":null,"abstract":"The design of effective multimodal feature fusion strategies is the key task for multimodal learning, which often requires huge computational costs with extensive expertise. In this paper, we seek to enhance multimodal learning via hierarchical fusion architecture search with inconsistency mitigation. Different from previous works, our Hierarchical Fusion Multimodal Neural Architecture Search (HF-MNAS) considers the inconsistency in modalities and labels, and fine-grained exploitation in multi-level fusion architectures. Specifically, it disentangles the hierarchical fusion problem into two-level (macro- and micro-level) search spaces. In the macro-level search space, the high-level and low-level features are extracted and then connected in a fine-grained way, where the inconsistency mitigation module is designed to minimize discrepancies between modalities and labels in cell outputs. In the micro-level search space, we find that different intermediate nodes in the cells exhibit different importance degrees. Then, we propose an importance-based node selection mechanism to form the optimal cells for feature fusion. We evaluate HF-MNAS on a series of multimodal classification tasks. Empirical evidence shows that HF-MNAS achieves competitive trade-off performance across accuracy, search time, and inference speed. In particular, HF-MNAS consumes minimal computational cost compared with state-of-the-art MNASs. Furthermore, we theoretically and experimentally verify that the modality-label inconsistency deteriorates the overall fusion performance of models such as accuracy and F1 score, and demonstrate that the proposed inconsistency mitigation module could effectively mitigate this phenomenon.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5458-5472"},"PeriodicalIF":13.7000,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11134693/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The design of effective multimodal feature fusion strategies is the key task for multimodal learning, which often requires huge computational costs with extensive expertise. In this paper, we seek to enhance multimodal learning via hierarchical fusion architecture search with inconsistency mitigation. Different from previous works, our Hierarchical Fusion Multimodal Neural Architecture Search (HF-MNAS) considers the inconsistency in modalities and labels, and fine-grained exploitation in multi-level fusion architectures. Specifically, it disentangles the hierarchical fusion problem into two-level (macro- and micro-level) search spaces. In the macro-level search space, the high-level and low-level features are extracted and then connected in a fine-grained way, where the inconsistency mitigation module is designed to minimize discrepancies between modalities and labels in cell outputs. In the micro-level search space, we find that different intermediate nodes in the cells exhibit different importance degrees. Then, we propose an importance-based node selection mechanism to form the optimal cells for feature fusion. We evaluate HF-MNAS on a series of multimodal classification tasks. Empirical evidence shows that HF-MNAS achieves competitive trade-off performance across accuracy, search time, and inference speed. In particular, HF-MNAS consumes minimal computational cost compared with state-of-the-art MNASs. Furthermore, we theoretically and experimentally verify that the modality-label inconsistency deteriorates the overall fusion performance of models such as accuracy and F1 score, and demonstrate that the proposed inconsistency mitigation module could effectively mitigate this phenomenon.

查看原文本刊更多论文

基于层次融合架构搜索和不一致缓解的多模态学习

设计有效的多模态特征融合策略是多模态学习的关键任务，这通常需要大量的计算成本和广泛的专业知识。在本文中，我们寻求通过分层融合架构搜索和不一致缓解来增强多模态学习。与以往的研究不同，我们的分层融合多模态神经结构搜索（HF-MNAS）考虑了模态和标签的不一致性，以及多层次融合体系结构的细粒度利用。具体来说，它将层次融合问题分解为两级（宏观和微观）搜索空间。在宏观级搜索空间中，提取高级和低级特征，然后以细粒度方式连接起来，其中不一致缓解模块旨在最大限度地减少单元输出中模式和标签之间的差异。在微观搜索空间中，我们发现细胞中不同的中间节点具有不同的重要程度。然后，我们提出了一种基于重要性的节点选择机制来形成最优的特征融合单元。我们在一系列多模态分类任务上评估了HF-MNAS。经验证据表明，HF-MNAS在准确性、搜索时间和推理速度方面实现了竞争性的权衡性能。特别是，与最先进的mnas相比，HF-MNAS消耗的计算成本最小。此外，我们从理论和实验上验证了模态-标签不一致会降低模型的整体融合性能，如准确性和F1分数，并证明了所提出的不一致缓解模块可以有效地缓解这一现象。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量