Alex Munyole Luvembe , Weimin Li , Shaohua Li , Guiqiong Xu , Xing Wu , Fangfang Liu
{"title":"An adaptive auto fusion with hierarchical attention for multimodal fake news detection","authors":"Alex Munyole Luvembe , Weimin Li , Shaohua Li , Guiqiong Xu , Xing Wu , Fangfang Liu","doi":"10.1016/j.eswa.2025.127930","DOIUrl":null,"url":null,"abstract":"<div><div>The phenomenon of fake news often relies on diverse multimodal evidence to deceive readers and achieve widespread popularity. While existing fusion methods aim to enhance feature interaction, they typically rely on concatenation or attention mechanisms that struggle to model nuanced dynamics of multimodal information due to missing data and modality heterogeneity. To overcome these limitations, we propose an <strong>A</strong>daptive <strong>A</strong>uto <strong>F</strong>usion with <strong>H</strong>ierarchical <strong>A</strong>ttention <strong>(AAFHA)</strong> framework for multimodal fake news detection. AAFHA integrates image captions directly into the fusion pipeline to strengthen cross-modal learning, unlike prior approaches that treat them as siloed inputs. We first design a multi-level interaction for text and captions by incorporating hierarchical encoding to capture both local and global dependencies, allowing the model to detect subtle cross-modal associations. Then, a sparse weighting technique, guided by hierarchical attention, further refines these interactions by dynamically allocating attention across modalities. This guided focus is implemented through a constrained SoftMax function, improving contextual alignment and reducing isolated feature modeling. To enable adaptive semantic integration, we introduce an Auto-Fusion module that supports dynamic end-to-end training. The model optimizes a learned similarity measure in a shared representation space, aligning textual, caption, and image features to adaptively capture semantic associations. Additionally, sparse training with contrastive loss is incorporated to preserve semantic consistency and enhance class separability during fusion. Experimental results demonstrate that AAFHA outperforms existing baselines, yielding accuracy improvements of 0.094%, 0.198%, and 0.001% on the PolitiFact, Gossip, and Pheme datasets, respectively. These findings demonstrate the model’s effectiveness in identifying multimodal fake news.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"285 ","pages":"Article 127930"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425015520","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The phenomenon of fake news often relies on diverse multimodal evidence to deceive readers and achieve widespread popularity. While existing fusion methods aim to enhance feature interaction, they typically rely on concatenation or attention mechanisms that struggle to model nuanced dynamics of multimodal information due to missing data and modality heterogeneity. To overcome these limitations, we propose an Adaptive Auto Fusion with Hierarchical Attention (AAFHA) framework for multimodal fake news detection. AAFHA integrates image captions directly into the fusion pipeline to strengthen cross-modal learning, unlike prior approaches that treat them as siloed inputs. We first design a multi-level interaction for text and captions by incorporating hierarchical encoding to capture both local and global dependencies, allowing the model to detect subtle cross-modal associations. Then, a sparse weighting technique, guided by hierarchical attention, further refines these interactions by dynamically allocating attention across modalities. This guided focus is implemented through a constrained SoftMax function, improving contextual alignment and reducing isolated feature modeling. To enable adaptive semantic integration, we introduce an Auto-Fusion module that supports dynamic end-to-end training. The model optimizes a learned similarity measure in a shared representation space, aligning textual, caption, and image features to adaptively capture semantic associations. Additionally, sparse training with contrastive loss is incorporated to preserve semantic consistency and enhance class separability during fusion. Experimental results demonstrate that AAFHA outperforms existing baselines, yielding accuracy improvements of 0.094%, 0.198%, and 0.001% on the PolitiFact, Gossip, and Pheme datasets, respectively. These findings demonstrate the model’s effectiveness in identifying multimodal fake news.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.