Yameng Tu, Jianbin Wu, Liang Lu, Shuaikang Gao, MingHao Li
{"title":"基于表情键序列的人脸伪造视频检测","authors":"Yameng Tu, Jianbin Wu, Liang Lu, Shuaikang Gao, MingHao Li","doi":"10.1016/j.jksuci.2024.102142","DOIUrl":null,"url":null,"abstract":"<div><p>In order to minimize additional computational costs in detecting forged videos, and enhance detection accuracy, this paper employs dynamic facial expression sequences as key sequences, replacing original video sequences as inputs for the detection model. A spatio-temporal dual-branch detection network is designed based on the visual Transformer architecture. Specifically, this process involves three steps. Firstly, dynamic facial expression sequences are localized as key sequences using optical flow difference algorithms. Subsequently, the spatial branch network employs the focal self-attention mechanism to focus on dynamic features of expression-relevant regions and uses Factorization Machines to facilitate feature interaction among multiple key sequences. Meanwhile, the temporal branch network concentrates on learning the temporal inconsistency of optical flow differences between adjacent frames. Finally, a binary classification linear SVM combines the Softmax values from the two branch networks to provide the ultimate detection outcome. Experimental results on the Faceforensics++ dataset demonstrate: (a) replacing whole video sequences with facial expression key sequences effectively reduces training and detection time by nearly 80% and 90%, respectively; (b) compared to state-of-the-art methods involving random sequence/frame extraction and key frame extraction based on video compression techniques, the proposed approach in this paper presents a more competitive detection accuracy.</p></div>","PeriodicalId":48547,"journal":{"name":"Journal of King Saud University-Computer and Information Sciences","volume":"36 7","pages":"Article 102142"},"PeriodicalIF":5.2000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1319157824002313/pdfft?md5=d3161c3d47c3e55bf622551f8213c551&pid=1-s2.0-S1319157824002313-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Face forgery video detection based on expression key sequences\",\"authors\":\"Yameng Tu, Jianbin Wu, Liang Lu, Shuaikang Gao, MingHao Li\",\"doi\":\"10.1016/j.jksuci.2024.102142\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In order to minimize additional computational costs in detecting forged videos, and enhance detection accuracy, this paper employs dynamic facial expression sequences as key sequences, replacing original video sequences as inputs for the detection model. A spatio-temporal dual-branch detection network is designed based on the visual Transformer architecture. Specifically, this process involves three steps. Firstly, dynamic facial expression sequences are localized as key sequences using optical flow difference algorithms. Subsequently, the spatial branch network employs the focal self-attention mechanism to focus on dynamic features of expression-relevant regions and uses Factorization Machines to facilitate feature interaction among multiple key sequences. Meanwhile, the temporal branch network concentrates on learning the temporal inconsistency of optical flow differences between adjacent frames. Finally, a binary classification linear SVM combines the Softmax values from the two branch networks to provide the ultimate detection outcome. Experimental results on the Faceforensics++ dataset demonstrate: (a) replacing whole video sequences with facial expression key sequences effectively reduces training and detection time by nearly 80% and 90%, respectively; (b) compared to state-of-the-art methods involving random sequence/frame extraction and key frame extraction based on video compression techniques, the proposed approach in this paper presents a more competitive detection accuracy.</p></div>\",\"PeriodicalId\":48547,\"journal\":{\"name\":\"Journal of King Saud University-Computer and Information Sciences\",\"volume\":\"36 7\",\"pages\":\"Article 102142\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2024-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1319157824002313/pdfft?md5=d3161c3d47c3e55bf622551f8213c551&pid=1-s2.0-S1319157824002313-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of King Saud University-Computer and Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1319157824002313\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of King Saud University-Computer and Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1319157824002313","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Face forgery video detection based on expression key sequences
In order to minimize additional computational costs in detecting forged videos, and enhance detection accuracy, this paper employs dynamic facial expression sequences as key sequences, replacing original video sequences as inputs for the detection model. A spatio-temporal dual-branch detection network is designed based on the visual Transformer architecture. Specifically, this process involves three steps. Firstly, dynamic facial expression sequences are localized as key sequences using optical flow difference algorithms. Subsequently, the spatial branch network employs the focal self-attention mechanism to focus on dynamic features of expression-relevant regions and uses Factorization Machines to facilitate feature interaction among multiple key sequences. Meanwhile, the temporal branch network concentrates on learning the temporal inconsistency of optical flow differences between adjacent frames. Finally, a binary classification linear SVM combines the Softmax values from the two branch networks to provide the ultimate detection outcome. Experimental results on the Faceforensics++ dataset demonstrate: (a) replacing whole video sequences with facial expression key sequences effectively reduces training and detection time by nearly 80% and 90%, respectively; (b) compared to state-of-the-art methods involving random sequence/frame extraction and key frame extraction based on video compression techniques, the proposed approach in this paper presents a more competitive detection accuracy.
期刊介绍:
In 2022 the Journal of King Saud University - Computer and Information Sciences will become an author paid open access journal. Authors who submit their manuscript after October 31st 2021 will be asked to pay an Article Processing Charge (APC) after acceptance of their paper to make their work immediately, permanently, and freely accessible to all. The Journal of King Saud University Computer and Information Sciences is a refereed, international journal that covers all aspects of both foundations of computer and its practical applications.