Yan Wang , Qindong Sun , Jingpeng Zhang , Dongzhu Rong , Chao Shen , Xiaoxiong Wang
{"title":"在视听异步场景下,利用预测的多模态对齐和特征重建改进深度伪造检测","authors":"Yan Wang , Qindong Sun , Jingpeng Zhang , Dongzhu Rong , Chao Shen , Xiaoxiong Wang","doi":"10.1016/j.inffus.2025.103708","DOIUrl":null,"url":null,"abstract":"<div><div>Existing multimodal deepfake detection methods primarily rely on capturing correlations between audio–visual modalities to improve detection performance. However, in scenarios such as instant messaging and online video conferencing, network jitter often leads to audio–visual asynchrony, disrupting inter-modal associations and limiting the effectiveness of these methods. To address this issue, we propose a deepfake detection framework specifically designed for audio–visual asynchrony scenarios. First, based on the theory of open balls in metric space, we analyze the variation mechanism of joint features in both audio–visual synchrony and asynchrony scenarios, revealing the impact of audio–visual asynchrony on detection performance. Second, we design a multimodal subspace representation module that incorporates hierarchical cross-modal semantic similarity to address inconsistencies in audio–visual data distributions and representation heterogeneity. Furthermore, we formulate audio–visual feature alignment as an integer linear programming task and employ the Hungarian algorithm to reconstruct missing inter-modal associations. Finally, we introduce a self-supervised masked reconstruction mechanism to restore missing features and construct a joint correlation matrix to measure cross-modal dependencies, enhancing the robustness of detection. Theoretical analysis and experimental results show that our method outperforms baselines in audio–visual synchrony and asynchrony scenarios and exhibits robustness against unknown disturbances.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103708"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving deepfake detection with predictive inter-modal alignment and feature reconstruction in audio–visual asynchrony scenarios\",\"authors\":\"Yan Wang , Qindong Sun , Jingpeng Zhang , Dongzhu Rong , Chao Shen , Xiaoxiong Wang\",\"doi\":\"10.1016/j.inffus.2025.103708\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Existing multimodal deepfake detection methods primarily rely on capturing correlations between audio–visual modalities to improve detection performance. However, in scenarios such as instant messaging and online video conferencing, network jitter often leads to audio–visual asynchrony, disrupting inter-modal associations and limiting the effectiveness of these methods. To address this issue, we propose a deepfake detection framework specifically designed for audio–visual asynchrony scenarios. First, based on the theory of open balls in metric space, we analyze the variation mechanism of joint features in both audio–visual synchrony and asynchrony scenarios, revealing the impact of audio–visual asynchrony on detection performance. Second, we design a multimodal subspace representation module that incorporates hierarchical cross-modal semantic similarity to address inconsistencies in audio–visual data distributions and representation heterogeneity. Furthermore, we formulate audio–visual feature alignment as an integer linear programming task and employ the Hungarian algorithm to reconstruct missing inter-modal associations. Finally, we introduce a self-supervised masked reconstruction mechanism to restore missing features and construct a joint correlation matrix to measure cross-modal dependencies, enhancing the robustness of detection. Theoretical analysis and experimental results show that our method outperforms baselines in audio–visual synchrony and asynchrony scenarios and exhibits robustness against unknown disturbances.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103708\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525007808\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525007808","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Improving deepfake detection with predictive inter-modal alignment and feature reconstruction in audio–visual asynchrony scenarios
Existing multimodal deepfake detection methods primarily rely on capturing correlations between audio–visual modalities to improve detection performance. However, in scenarios such as instant messaging and online video conferencing, network jitter often leads to audio–visual asynchrony, disrupting inter-modal associations and limiting the effectiveness of these methods. To address this issue, we propose a deepfake detection framework specifically designed for audio–visual asynchrony scenarios. First, based on the theory of open balls in metric space, we analyze the variation mechanism of joint features in both audio–visual synchrony and asynchrony scenarios, revealing the impact of audio–visual asynchrony on detection performance. Second, we design a multimodal subspace representation module that incorporates hierarchical cross-modal semantic similarity to address inconsistencies in audio–visual data distributions and representation heterogeneity. Furthermore, we formulate audio–visual feature alignment as an integer linear programming task and employ the Hungarian algorithm to reconstruct missing inter-modal associations. Finally, we introduce a self-supervised masked reconstruction mechanism to restore missing features and construct a joint correlation matrix to measure cross-modal dependencies, enhancing the robustness of detection. Theoretical analysis and experimental results show that our method outperforms baselines in audio–visual synchrony and asynchrony scenarios and exhibits robustness against unknown disturbances.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.