{"title":"Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video Detection","authors":"Dengyong Zhang, Wenjie Zhu, Xin Liao, Feifan Qi, Gaobo Yang, Xiangling Ding","doi":"10.1145/3664654","DOIUrl":null,"url":null,"abstract":"<p>With the rise of the metaverse, the rapid advancement of Deepfakes technology has become closely intertwined. Within the metaverse, individuals exist in digital form and engage in interactions, transactions, and communications through virtual avatars. However, the development of Deepfakes technology has led to the proliferation of forged information disseminated under the guise of users’ virtual identities, posing significant security risks to the metaverse. Hence, there is an urgent need to research and develop more robust methods for detecting deep forgeries to address these challenges. This paper explores deepfake video detection by leveraging the spatiotemporal inconsistencies generated by deepfake generation techniques, and thereby proposing the interactive spatioTemporal inconsistency learning and interactive fusion (ST-ILIF) detection method, which consists of phase-aware and sequence streams. The spatial inconsistencies exhibited in frames of deepfake videos are primarily attributed to variations in the structural information contained within the phase component of the Fourier domain. To mitigate the issue of overfitting the content information, a phase-aware stream is introduced to learn the spatial inconsistencies from the phase-based reconstructed frames. Additionally, considering that deepfake videos are generated frame-by-frame and lack temporal consistency between frames, a sequence stream is proposed to extract temporal inconsistency features from the spatiotemporal difference information between consecutive frames. Finally, through feature interaction and fusion of the two streams, the representation ability of intermediate and classification features is further enhanced. The proposed method, which was evaluated on four mainstream datasets, outperformed most existing methods, and extensive experimental results demonstrated its effectiveness in identifying deepfake videos. Our source code is available at https://github.com/qff98/Deepfake-Video-Detection</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"156 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3664654","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
With the rise of the metaverse, the rapid advancement of Deepfakes technology has become closely intertwined. Within the metaverse, individuals exist in digital form and engage in interactions, transactions, and communications through virtual avatars. However, the development of Deepfakes technology has led to the proliferation of forged information disseminated under the guise of users’ virtual identities, posing significant security risks to the metaverse. Hence, there is an urgent need to research and develop more robust methods for detecting deep forgeries to address these challenges. This paper explores deepfake video detection by leveraging the spatiotemporal inconsistencies generated by deepfake generation techniques, and thereby proposing the interactive spatioTemporal inconsistency learning and interactive fusion (ST-ILIF) detection method, which consists of phase-aware and sequence streams. The spatial inconsistencies exhibited in frames of deepfake videos are primarily attributed to variations in the structural information contained within the phase component of the Fourier domain. To mitigate the issue of overfitting the content information, a phase-aware stream is introduced to learn the spatial inconsistencies from the phase-based reconstructed frames. Additionally, considering that deepfake videos are generated frame-by-frame and lack temporal consistency between frames, a sequence stream is proposed to extract temporal inconsistency features from the spatiotemporal difference information between consecutive frames. Finally, through feature interaction and fusion of the two streams, the representation ability of intermediate and classification features is further enhanced. The proposed method, which was evaluated on four mainstream datasets, outperformed most existing methods, and extensive experimental results demonstrated its effectiveness in identifying deepfake videos. Our source code is available at https://github.com/qff98/Deepfake-Video-Detection
期刊介绍:
The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome.
TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.