NPVForensics: Learning VA correlations in non-critical phoneme–viseme regions for deepfake detection

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yu Chen , Yang Yu , Rongrong Ni , Haoliang Li , Wei Wang , Yao Zhao
{"title":"NPVForensics: Learning VA correlations in non-critical phoneme–viseme regions for deepfake detection","authors":"Yu Chen ,&nbsp;Yang Yu ,&nbsp;Rongrong Ni ,&nbsp;Haoliang Li ,&nbsp;Wei Wang ,&nbsp;Yao Zhao","doi":"10.1016/j.imavis.2025.105461","DOIUrl":null,"url":null,"abstract":"<div><div>Advanced deepfake technology enables the manipulation of visual and audio signals within videos, leading to visual–audio (VA) inconsistencies. Current multimodal detectors primarily rely on VA contrastive learning to identify such inconsistencies, particularly in critical phoneme–viseme (PV) regions. However, state-of-the-art deepfake techniques have aligned critical PV pairs, thereby reducing the inconsistency traces on which existing methods rely. Due to technical constraints, forgers cannot fully synchronize VA in non-critical phoneme–viseme (NPV) regions. Consequently, we exploit inconsistencies in NPV regions as a general cue for deepfake detection. We propose NPVForensics, a two-stage VA correlation learning framework specifically designed to detect VA inconsistencies in NPV regions of deepfake videos. Firstly, to better extract VA unimodal features, we utilize the Swin Transformer to capture long-term global dependencies. Additionally, the Local Feature Aggregation (LFA) module aggregates local features from spatial and channel dimensions, thus preserving more comprehensive and subtle information. Secondly, the VA Correlation Learning (VACL) module enhances intra-modal augmentation and inter-modal information interaction, exploring intrinsic correlations between the two modalities. Moreover, Representation Alignment is introduced for real videos to narrow the modal gap and effectively extract VA correlations. Finally, our model is pre-trained on real videos using a self-supervised strategy and fine-tuned for the deepfake detection task. We conducted extensive experiments on six widely used deepfake datasets: FaceForensics++, FakeAVCeleb, Celeb-DF-v2, DFDC, FaceShifter, and DeeperForensics-1.0. Our method achieves state-of-the-art performance in cross-manipulation generalization and robustness. Notably, our approach demonstrates superior performance on VA-coordinated datasets such as A2V, T2V-L, and T2V-S. It indicates that VA inconsistencies in NPV regions serve as a general cue for deepfake detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105461"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000496","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Advanced deepfake technology enables the manipulation of visual and audio signals within videos, leading to visual–audio (VA) inconsistencies. Current multimodal detectors primarily rely on VA contrastive learning to identify such inconsistencies, particularly in critical phoneme–viseme (PV) regions. However, state-of-the-art deepfake techniques have aligned critical PV pairs, thereby reducing the inconsistency traces on which existing methods rely. Due to technical constraints, forgers cannot fully synchronize VA in non-critical phoneme–viseme (NPV) regions. Consequently, we exploit inconsistencies in NPV regions as a general cue for deepfake detection. We propose NPVForensics, a two-stage VA correlation learning framework specifically designed to detect VA inconsistencies in NPV regions of deepfake videos. Firstly, to better extract VA unimodal features, we utilize the Swin Transformer to capture long-term global dependencies. Additionally, the Local Feature Aggregation (LFA) module aggregates local features from spatial and channel dimensions, thus preserving more comprehensive and subtle information. Secondly, the VA Correlation Learning (VACL) module enhances intra-modal augmentation and inter-modal information interaction, exploring intrinsic correlations between the two modalities. Moreover, Representation Alignment is introduced for real videos to narrow the modal gap and effectively extract VA correlations. Finally, our model is pre-trained on real videos using a self-supervised strategy and fine-tuned for the deepfake detection task. We conducted extensive experiments on six widely used deepfake datasets: FaceForensics++, FakeAVCeleb, Celeb-DF-v2, DFDC, FaceShifter, and DeeperForensics-1.0. Our method achieves state-of-the-art performance in cross-manipulation generalization and robustness. Notably, our approach demonstrates superior performance on VA-coordinated datasets such as A2V, T2V-L, and T2V-S. It indicates that VA inconsistencies in NPV regions serve as a general cue for deepfake detection.
NPVForensics:学习非关键音素-视觉素区域的VA相关性,用于深度伪造检测
先进的深度伪造技术能够在视频中操纵视觉和音频信号,导致视觉音频(VA)不一致。目前的多模态检测器主要依靠VA对比学习来识别这种不一致性,特别是在关键的音素-视素(PV)区域。然而,最先进的深度伪造技术已经对齐了关键PV对,从而减少了现有方法所依赖的不一致痕迹。由于技术限制,伪造者不能完全同步非关键音素-视觉音素(NPV)区域的VA。因此,我们利用NPV区域的不一致性作为深度伪造检测的一般线索。我们提出了NPVForensics,这是一个两阶段的VA相关学习框架,专门用于检测深度假视频中NPV区域的VA不一致性。首先,为了更好地提取VA单峰特征,我们利用Swin Transformer来捕获长期全局依赖关系。此外,局部特征聚合(LFA)模块从空间和通道维度聚合局部特征,从而保留更全面和微妙的信息。其次,VA相关学习(VACL)模块增强了模态内增强和模态间信息交互,探索了两种模态之间的内在相关性。此外,在真实视频中引入了表示对齐,以缩小模态差距,有效地提取VA相关性。最后,我们的模型使用自监督策略对真实视频进行预训练,并对深度伪造检测任务进行微调。我们在六个广泛使用的深度伪造数据集上进行了广泛的实验:FaceForensics++、FakeAVCeleb、Celeb-DF-v2、DFDC、FaceShifter和DeeperForensics-1.0。我们的方法在交叉操作泛化和鲁棒性方面达到了最先进的性能。值得注意的是,我们的方法在va协调数据集(如A2V, tv2 - l和tv2 - s)上表现出了卓越的性能。这表明NPV区域的VA不一致性可以作为深度伪造检测的一般线索。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Image and Vision Computing
Image and Vision Computing 工程技术-工程:电子与电气
CiteScore
8.50
自引率
8.50%
发文量
143
审稿时长
7.8 months
期刊介绍: Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信