{"title":"基于多模态的多层次语义线索提取人脸伪造检测","authors":"Lingyun Yu;Tian Xie;Chuanbin Liu;Guoqing Jin;Zhiguo Ding;Hongtao Xie","doi":"10.1109/TCSVT.2024.3524602","DOIUrl":null,"url":null,"abstract":"Existing face forgery detection methods attempt to identify low-level forgery artifacts (e.g., blending boundary, flickering) in spatial-temporal domains or high-level semantic inconsistencies (e.g., abnormal lip movements) between visual-auditory modalities for generalized face forgery detection. However, they still suffer from significant performance degradation when dealing with out-of-domain artifacts, as they only consider single semantic mode inconsistencies, but ignore the complementarity of forgery traces at different levels and different modalities. In this paper, we propose a novel Multi-modal Multi-level Semantic Cues Distillation Detection framework that adopts the teacher-student protocol to focus on both spatial-temporal artifacts and visual-auditory incoherence to capture multi-level semantic cues. Specifically, our framework primarily comprises the Spatial-Temporal Pattern Learning module and the Visual-Auditory Consistency Modeling module. The Spatial-Temporal Pattern Learning module employs a mask-reconstruction strategy, in which the student network learns diverse spatial-temporal patterns from a pixel-wise teacher network to capture low-level forgery artifacts. The Visual-Auditory Consistency Modeling module is designed to enhance the student network’s ability to identify high-level semantic irregularities, with a visual-auditory consistency modeling expert serving as a guide. Furthermore, a novel Real-Similarity loss is proposed to enhance the proximity of real faces in feature space without explicitly penalizing the distance from manipulated faces, which prevents the overfitting in particular manipulation methods and improves the generalization capability. Extensive experiments show that our method substantially improves the generalization and robustness performance. Particularly, our approach outperforms the SOTA detector by 1.4% in generalization performance on DFDC with large domain gaps, and by 2.0% in the robustness evaluation on the FF++ dataset under various extreme settings. Our code is available at <uri>https://github.com/TianXie834/M2SD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4698-4712"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distilling Multi-Level Semantic Cues Across Multi-Modalities for Face Forgery Detection\",\"authors\":\"Lingyun Yu;Tian Xie;Chuanbin Liu;Guoqing Jin;Zhiguo Ding;Hongtao Xie\",\"doi\":\"10.1109/TCSVT.2024.3524602\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing face forgery detection methods attempt to identify low-level forgery artifacts (e.g., blending boundary, flickering) in spatial-temporal domains or high-level semantic inconsistencies (e.g., abnormal lip movements) between visual-auditory modalities for generalized face forgery detection. However, they still suffer from significant performance degradation when dealing with out-of-domain artifacts, as they only consider single semantic mode inconsistencies, but ignore the complementarity of forgery traces at different levels and different modalities. In this paper, we propose a novel Multi-modal Multi-level Semantic Cues Distillation Detection framework that adopts the teacher-student protocol to focus on both spatial-temporal artifacts and visual-auditory incoherence to capture multi-level semantic cues. Specifically, our framework primarily comprises the Spatial-Temporal Pattern Learning module and the Visual-Auditory Consistency Modeling module. The Spatial-Temporal Pattern Learning module employs a mask-reconstruction strategy, in which the student network learns diverse spatial-temporal patterns from a pixel-wise teacher network to capture low-level forgery artifacts. The Visual-Auditory Consistency Modeling module is designed to enhance the student network’s ability to identify high-level semantic irregularities, with a visual-auditory consistency modeling expert serving as a guide. Furthermore, a novel Real-Similarity loss is proposed to enhance the proximity of real faces in feature space without explicitly penalizing the distance from manipulated faces, which prevents the overfitting in particular manipulation methods and improves the generalization capability. Extensive experiments show that our method substantially improves the generalization and robustness performance. Particularly, our approach outperforms the SOTA detector by 1.4% in generalization performance on DFDC with large domain gaps, and by 2.0% in the robustness evaluation on the FF++ dataset under various extreme settings. Our code is available at <uri>https://github.com/TianXie834/M2SD</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4698-4712\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10819430/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10819430/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Distilling Multi-Level Semantic Cues Across Multi-Modalities for Face Forgery Detection
Existing face forgery detection methods attempt to identify low-level forgery artifacts (e.g., blending boundary, flickering) in spatial-temporal domains or high-level semantic inconsistencies (e.g., abnormal lip movements) between visual-auditory modalities for generalized face forgery detection. However, they still suffer from significant performance degradation when dealing with out-of-domain artifacts, as they only consider single semantic mode inconsistencies, but ignore the complementarity of forgery traces at different levels and different modalities. In this paper, we propose a novel Multi-modal Multi-level Semantic Cues Distillation Detection framework that adopts the teacher-student protocol to focus on both spatial-temporal artifacts and visual-auditory incoherence to capture multi-level semantic cues. Specifically, our framework primarily comprises the Spatial-Temporal Pattern Learning module and the Visual-Auditory Consistency Modeling module. The Spatial-Temporal Pattern Learning module employs a mask-reconstruction strategy, in which the student network learns diverse spatial-temporal patterns from a pixel-wise teacher network to capture low-level forgery artifacts. The Visual-Auditory Consistency Modeling module is designed to enhance the student network’s ability to identify high-level semantic irregularities, with a visual-auditory consistency modeling expert serving as a guide. Furthermore, a novel Real-Similarity loss is proposed to enhance the proximity of real faces in feature space without explicitly penalizing the distance from manipulated faces, which prevents the overfitting in particular manipulation methods and improves the generalization capability. Extensive experiments show that our method substantially improves the generalization and robustness performance. Particularly, our approach outperforms the SOTA detector by 1.4% in generalization performance on DFDC with large domain gaps, and by 2.0% in the robustness evaluation on the FF++ dataset under various extreme settings. Our code is available at https://github.com/TianXie834/M2SD.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.