Improving deepfake detection with predictive inter-modal alignment and feature reconstruction in audio–visual asynchrony scenarios

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-09-12 DOI:10.1016/j.inffus.2025.103708

Yan Wang , Qindong Sun , Jingpeng Zhang , Dongzhu Rong , Chao Shen , Xiaoxiong Wang

{"title":"Improving deepfake detection with predictive inter-modal alignment and feature reconstruction in audio–visual asynchrony scenarios","authors":"Yan Wang , Qindong Sun , Jingpeng Zhang , Dongzhu Rong , Chao Shen , Xiaoxiong Wang","doi":"10.1016/j.inffus.2025.103708","DOIUrl":null,"url":null,"abstract":"<div><div>Existing multimodal deepfake detection methods primarily rely on capturing correlations between audio–visual modalities to improve detection performance. However, in scenarios such as instant messaging and online video conferencing, network jitter often leads to audio–visual asynchrony, disrupting inter-modal associations and limiting the effectiveness of these methods. To address this issue, we propose a deepfake detection framework specifically designed for audio–visual asynchrony scenarios. First, based on the theory of open balls in metric space, we analyze the variation mechanism of joint features in both audio–visual synchrony and asynchrony scenarios, revealing the impact of audio–visual asynchrony on detection performance. Second, we design a multimodal subspace representation module that incorporates hierarchical cross-modal semantic similarity to address inconsistencies in audio–visual data distributions and representation heterogeneity. Furthermore, we formulate audio–visual feature alignment as an integer linear programming task and employ the Hungarian algorithm to reconstruct missing inter-modal associations. Finally, we introduce a self-supervised masked reconstruction mechanism to restore missing features and construct a joint correlation matrix to measure cross-modal dependencies, enhancing the robustness of detection. Theoretical analysis and experimental results show that our method outperforms baselines in audio–visual synchrony and asynchrony scenarios and exhibits robustness against unknown disturbances.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103708"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525007808","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Existing multimodal deepfake detection methods primarily rely on capturing correlations between audio–visual modalities to improve detection performance. However, in scenarios such as instant messaging and online video conferencing, network jitter often leads to audio–visual asynchrony, disrupting inter-modal associations and limiting the effectiveness of these methods. To address this issue, we propose a deepfake detection framework specifically designed for audio–visual asynchrony scenarios. First, based on the theory of open balls in metric space, we analyze the variation mechanism of joint features in both audio–visual synchrony and asynchrony scenarios, revealing the impact of audio–visual asynchrony on detection performance. Second, we design a multimodal subspace representation module that incorporates hierarchical cross-modal semantic similarity to address inconsistencies in audio–visual data distributions and representation heterogeneity. Furthermore, we formulate audio–visual feature alignment as an integer linear programming task and employ the Hungarian algorithm to reconstruct missing inter-modal associations. Finally, we introduce a self-supervised masked reconstruction mechanism to restore missing features and construct a joint correlation matrix to measure cross-modal dependencies, enhancing the robustness of detection. Theoretical analysis and experimental results show that our method outperforms baselines in audio–visual synchrony and asynchrony scenarios and exhibits robustness against unknown disturbances.

查看原文本刊更多论文

在视听异步场景下，利用预测的多模态对齐和特征重建改进深度伪造检测

现有的多模态深度伪造检测方法主要依靠捕获视听模态之间的相关性来提高检测性能。然而，在即时通讯和在线视频会议等场景中，网络抖动经常导致视听异步，破坏了多模式关联并限制了这些方法的有效性。为了解决这个问题，我们提出了一个专门为视听异步场景设计的深度伪造检测框架。首先，基于度量空间中的开球理论，分析了视听同步和异步场景下关节特征的变化机制，揭示了视听异步对检测性能的影响。其次，我们设计了一个多模态子空间表示模块，该模块结合了分层跨模态语义相似性，以解决视听数据分布的不一致性和表示异质性。此外，我们将视听特征对齐制定为整数线性规划任务，并使用匈牙利算法重建缺失的多式联运关联。最后，我们引入了一种自监督掩模重建机制来恢复缺失的特征，并构建了一个联合相关矩阵来度量跨模态依赖性，增强了检测的鲁棒性。理论分析和实验结果表明，该方法在视听同步和异步场景下优于基线，对未知干扰具有鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.