Improving deepfake detection with predictive inter-modal alignment and feature reconstruction in audio–visual asynchrony scenarios

IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yan Wang , Qindong Sun , Jingpeng Zhang , Dongzhu Rong , Chao Shen , Xiaoxiong Wang
{"title":"Improving deepfake detection with predictive inter-modal alignment and feature reconstruction in audio–visual asynchrony scenarios","authors":"Yan Wang ,&nbsp;Qindong Sun ,&nbsp;Jingpeng Zhang ,&nbsp;Dongzhu Rong ,&nbsp;Chao Shen ,&nbsp;Xiaoxiong Wang","doi":"10.1016/j.inffus.2025.103708","DOIUrl":null,"url":null,"abstract":"<div><div>Existing multimodal deepfake detection methods primarily rely on capturing correlations between audio–visual modalities to improve detection performance. However, in scenarios such as instant messaging and online video conferencing, network jitter often leads to audio–visual asynchrony, disrupting inter-modal associations and limiting the effectiveness of these methods. To address this issue, we propose a deepfake detection framework specifically designed for audio–visual asynchrony scenarios. First, based on the theory of open balls in metric space, we analyze the variation mechanism of joint features in both audio–visual synchrony and asynchrony scenarios, revealing the impact of audio–visual asynchrony on detection performance. Second, we design a multimodal subspace representation module that incorporates hierarchical cross-modal semantic similarity to address inconsistencies in audio–visual data distributions and representation heterogeneity. Furthermore, we formulate audio–visual feature alignment as an integer linear programming task and employ the Hungarian algorithm to reconstruct missing inter-modal associations. Finally, we introduce a self-supervised masked reconstruction mechanism to restore missing features and construct a joint correlation matrix to measure cross-modal dependencies, enhancing the robustness of detection. Theoretical analysis and experimental results show that our method outperforms baselines in audio–visual synchrony and asynchrony scenarios and exhibits robustness against unknown disturbances.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103708"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525007808","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Existing multimodal deepfake detection methods primarily rely on capturing correlations between audio–visual modalities to improve detection performance. However, in scenarios such as instant messaging and online video conferencing, network jitter often leads to audio–visual asynchrony, disrupting inter-modal associations and limiting the effectiveness of these methods. To address this issue, we propose a deepfake detection framework specifically designed for audio–visual asynchrony scenarios. First, based on the theory of open balls in metric space, we analyze the variation mechanism of joint features in both audio–visual synchrony and asynchrony scenarios, revealing the impact of audio–visual asynchrony on detection performance. Second, we design a multimodal subspace representation module that incorporates hierarchical cross-modal semantic similarity to address inconsistencies in audio–visual data distributions and representation heterogeneity. Furthermore, we formulate audio–visual feature alignment as an integer linear programming task and employ the Hungarian algorithm to reconstruct missing inter-modal associations. Finally, we introduce a self-supervised masked reconstruction mechanism to restore missing features and construct a joint correlation matrix to measure cross-modal dependencies, enhancing the robustness of detection. Theoretical analysis and experimental results show that our method outperforms baselines in audio–visual synchrony and asynchrony scenarios and exhibits robustness against unknown disturbances.
在视听异步场景下,利用预测的多模态对齐和特征重建改进深度伪造检测
现有的多模态深度伪造检测方法主要依靠捕获视听模态之间的相关性来提高检测性能。然而,在即时通讯和在线视频会议等场景中,网络抖动经常导致视听异步,破坏了多模式关联并限制了这些方法的有效性。为了解决这个问题,我们提出了一个专门为视听异步场景设计的深度伪造检测框架。首先,基于度量空间中的开球理论,分析了视听同步和异步场景下关节特征的变化机制,揭示了视听异步对检测性能的影响。其次,我们设计了一个多模态子空间表示模块,该模块结合了分层跨模态语义相似性,以解决视听数据分布的不一致性和表示异质性。此外,我们将视听特征对齐制定为整数线性规划任务,并使用匈牙利算法重建缺失的多式联运关联。最后,我们引入了一种自监督掩模重建机制来恢复缺失的特征,并构建了一个联合相关矩阵来度量跨模态依赖性,增强了检测的鲁棒性。理论分析和实验结果表明,该方法在视听同步和异步场景下优于基线,对未知干扰具有鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信