基于多模态的多层次语义线索提取人脸伪造检测

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-31 DOI:10.1109/TCSVT.2024.3524602

Lingyun Yu;Tian Xie;Chuanbin Liu;Guoqing Jin;Zhiguo Ding;Hongtao Xie

{"title":"基于多模态的多层次语义线索提取人脸伪造检测","authors":"Lingyun Yu;Tian Xie;Chuanbin Liu;Guoqing Jin;Zhiguo Ding;Hongtao Xie","doi":"10.1109/TCSVT.2024.3524602","DOIUrl":null,"url":null,"abstract":"Existing face forgery detection methods attempt to identify low-level forgery artifacts (e.g., blending boundary, flickering) in spatial-temporal domains or high-level semantic inconsistencies (e.g., abnormal lip movements) between visual-auditory modalities for generalized face forgery detection. However, they still suffer from significant performance degradation when dealing with out-of-domain artifacts, as they only consider single semantic mode inconsistencies, but ignore the complementarity of forgery traces at different levels and different modalities. In this paper, we propose a novel Multi-modal Multi-level Semantic Cues Distillation Detection framework that adopts the teacher-student protocol to focus on both spatial-temporal artifacts and visual-auditory incoherence to capture multi-level semantic cues. Specifically, our framework primarily comprises the Spatial-Temporal Pattern Learning module and the Visual-Auditory Consistency Modeling module. The Spatial-Temporal Pattern Learning module employs a mask-reconstruction strategy, in which the student network learns diverse spatial-temporal patterns from a pixel-wise teacher network to capture low-level forgery artifacts. The Visual-Auditory Consistency Modeling module is designed to enhance the student network’s ability to identify high-level semantic irregularities, with a visual-auditory consistency modeling expert serving as a guide. Furthermore, a novel Real-Similarity loss is proposed to enhance the proximity of real faces in feature space without explicitly penalizing the distance from manipulated faces, which prevents the overfitting in particular manipulation methods and improves the generalization capability. Extensive experiments show that our method substantially improves the generalization and robustness performance. Particularly, our approach outperforms the SOTA detector by 1.4% in generalization performance on DFDC with large domain gaps, and by 2.0% in the robustness evaluation on the FF++ dataset under various extreme settings. Our code is available at <uri>https://github.com/TianXie834/M2SD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4698-4712"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distilling Multi-Level Semantic Cues Across Multi-Modalities for Face Forgery Detection\",\"authors\":\"Lingyun Yu;Tian Xie;Chuanbin Liu;Guoqing Jin;Zhiguo Ding;Hongtao Xie\",\"doi\":\"10.1109/TCSVT.2024.3524602\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing face forgery detection methods attempt to identify low-level forgery artifacts (e.g., blending boundary, flickering) in spatial-temporal domains or high-level semantic inconsistencies (e.g., abnormal lip movements) between visual-auditory modalities for generalized face forgery detection. However, they still suffer from significant performance degradation when dealing with out-of-domain artifacts, as they only consider single semantic mode inconsistencies, but ignore the complementarity of forgery traces at different levels and different modalities. In this paper, we propose a novel Multi-modal Multi-level Semantic Cues Distillation Detection framework that adopts the teacher-student protocol to focus on both spatial-temporal artifacts and visual-auditory incoherence to capture multi-level semantic cues. Specifically, our framework primarily comprises the Spatial-Temporal Pattern Learning module and the Visual-Auditory Consistency Modeling module. The Spatial-Temporal Pattern Learning module employs a mask-reconstruction strategy, in which the student network learns diverse spatial-temporal patterns from a pixel-wise teacher network to capture low-level forgery artifacts. The Visual-Auditory Consistency Modeling module is designed to enhance the student network’s ability to identify high-level semantic irregularities, with a visual-auditory consistency modeling expert serving as a guide. Furthermore, a novel Real-Similarity loss is proposed to enhance the proximity of real faces in feature space without explicitly penalizing the distance from manipulated faces, which prevents the overfitting in particular manipulation methods and improves the generalization capability. Extensive experiments show that our method substantially improves the generalization and robustness performance. Particularly, our approach outperforms the SOTA detector by 1.4% in generalization performance on DFDC with large domain gaps, and by 2.0% in the robustness evaluation on the FF++ dataset under various extreme settings. Our code is available at <uri>https://github.com/TianXie834/M2SD</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4698-4712\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10819430/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10819430/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

现有的人脸伪造检测方法试图在时空域中识别低级伪造工件（例如，混合边界，闪烁）或在视觉-听觉模式之间识别高级语义不一致（例如，异常嘴唇运动）以进行广义人脸伪造检测。然而，它们在处理域外工件时仍然存在明显的性能下降，因为它们只考虑单一语义模式的不一致性，而忽略了不同层次和不同模式的伪造痕迹的互补性。在本文中，我们提出了一种新的多模态多层次语义线索蒸馏检测框架，该框架采用师生协议，同时关注时空伪像和视觉听觉不连贯来捕获多层次语义线索。具体来说，我们的框架主要包括时空模式学习模块和视觉听觉一致性建模模块。时空模式学习模块采用掩模重建策略，其中学生网络从像素级教师网络中学习不同的时空模式，以捕获低级伪造工件。视觉-听觉一致性建模模块旨在增强学生网络识别高级语义不规则性的能力，视觉-听觉一致性建模专家作为指导。在此基础上，提出了一种新的real - similarity loss来增强真实人脸在特征空间中的接近性，而不会显式地惩罚与被操纵人脸的距离，从而避免了特定操作方法的过拟合，提高了泛化能力。大量的实验表明，我们的方法大大提高了泛化性能和鲁棒性。特别是，我们的方法在具有大域差的DFDC上的泛化性能比SOTA检测器高1.4%，在各种极端设置下的FF++数据集上的鲁棒性评估比SOTA检测器高2.0%。我们的代码可在https://github.com/TianXie834/M2SD上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Distilling Multi-Level Semantic Cues Across Multi-Modalities for Face Forgery Detection

Existing face forgery detection methods attempt to identify low-level forgery artifacts (e.g., blending boundary, flickering) in spatial-temporal domains or high-level semantic inconsistencies (e.g., abnormal lip movements) between visual-auditory modalities for generalized face forgery detection. However, they still suffer from significant performance degradation when dealing with out-of-domain artifacts, as they only consider single semantic mode inconsistencies, but ignore the complementarity of forgery traces at different levels and different modalities. In this paper, we propose a novel Multi-modal Multi-level Semantic Cues Distillation Detection framework that adopts the teacher-student protocol to focus on both spatial-temporal artifacts and visual-auditory incoherence to capture multi-level semantic cues. Specifically, our framework primarily comprises the Spatial-Temporal Pattern Learning module and the Visual-Auditory Consistency Modeling module. The Spatial-Temporal Pattern Learning module employs a mask-reconstruction strategy, in which the student network learns diverse spatial-temporal patterns from a pixel-wise teacher network to capture low-level forgery artifacts. The Visual-Auditory Consistency Modeling module is designed to enhance the student network’s ability to identify high-level semantic irregularities, with a visual-auditory consistency modeling expert serving as a guide. Furthermore, a novel Real-Similarity loss is proposed to enhance the proximity of real faces in feature space without explicitly penalizing the distance from manipulated faces, which prevents the overfitting in particular manipulation methods and improves the generalization capability. Extensive experiments show that our method substantially improves the generalization and robustness performance. Particularly, our approach outperforms the SOTA detector by 1.4% in generalization performance on DFDC with large domain gaps, and by 2.0% in the robustness evaluation on the FF++ dataset under various extreme settings. Our code is available at https://github.com/TianXie834/M2SD.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.