Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video Detection

IF 5.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-05-13 DOI:10.1145/3664654

Dengyong Zhang, Wenjie Zhu, Xin Liao, Feifan Qi, Gaobo Yang, Xiangling Ding

{"title":"Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video Detection","authors":"Dengyong Zhang, Wenjie Zhu, Xin Liao, Feifan Qi, Gaobo Yang, Xiangling Ding","doi":"10.1145/3664654","DOIUrl":null,"url":null,"abstract":"<p>With the rise of the metaverse, the rapid advancement of Deepfakes technology has become closely intertwined. Within the metaverse, individuals exist in digital form and engage in interactions, transactions, and communications through virtual avatars. However, the development of Deepfakes technology has led to the proliferation of forged information disseminated under the guise of users’ virtual identities, posing significant security risks to the metaverse. Hence, there is an urgent need to research and develop more robust methods for detecting deep forgeries to address these challenges. This paper explores deepfake video detection by leveraging the spatiotemporal inconsistencies generated by deepfake generation techniques, and thereby proposing the interactive spatioTemporal inconsistency learning and interactive fusion (ST-ILIF) detection method, which consists of phase-aware and sequence streams. The spatial inconsistencies exhibited in frames of deepfake videos are primarily attributed to variations in the structural information contained within the phase component of the Fourier domain. To mitigate the issue of overfitting the content information, a phase-aware stream is introduced to learn the spatial inconsistencies from the phase-based reconstructed frames. Additionally, considering that deepfake videos are generated frame-by-frame and lack temporal consistency between frames, a sequence stream is proposed to extract temporal inconsistency features from the spatiotemporal difference information between consecutive frames. Finally, through feature interaction and fusion of the two streams, the representation ability of intermediate and classification features is further enhanced. The proposed method, which was evaluated on four mainstream datasets, outperformed most existing methods, and extensive experimental results demonstrated its effectiveness in identifying deepfake videos. Our source code is available at https://github.com/qff98/Deepfake-Video-Detection</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"156 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3664654","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

With the rise of the metaverse, the rapid advancement of Deepfakes technology has become closely intertwined. Within the metaverse, individuals exist in digital form and engage in interactions, transactions, and communications through virtual avatars. However, the development of Deepfakes technology has led to the proliferation of forged information disseminated under the guise of users’ virtual identities, posing significant security risks to the metaverse. Hence, there is an urgent need to research and develop more robust methods for detecting deep forgeries to address these challenges. This paper explores deepfake video detection by leveraging the spatiotemporal inconsistencies generated by deepfake generation techniques, and thereby proposing the interactive spatioTemporal inconsistency learning and interactive fusion (ST-ILIF) detection method, which consists of phase-aware and sequence streams. The spatial inconsistencies exhibited in frames of deepfake videos are primarily attributed to variations in the structural information contained within the phase component of the Fourier domain. To mitigate the issue of overfitting the content information, a phase-aware stream is introduced to learn the spatial inconsistencies from the phase-based reconstructed frames. Additionally, considering that deepfake videos are generated frame-by-frame and lack temporal consistency between frames, a sequence stream is proposed to extract temporal inconsistency features from the spatiotemporal difference information between consecutive frames. Finally, through feature interaction and fusion of the two streams, the representation ability of intermediate and classification features is further enhanced. The proposed method, which was evaluated on four mainstream datasets, outperformed most existing methods, and extensive experimental results demonstrated its effectiveness in identifying deepfake videos. Our source code is available at https://github.com/qff98/Deepfake-Video-Detection

查看原文本刊更多论文

时空不一致性学习与交互式融合用于深度伪造视频检测

随着元宇宙的兴起，Deepfakes 技术的飞速发展与元宇宙紧密相连。在元宇宙中，个人以数字形式存在，并通过虚拟化身参与互动、交易和通信。然而，Deepfakes 技术的发展导致以用户虚拟身份为幌子的伪造信息泛滥，给元宇宙带来了巨大的安全风险。因此，迫切需要研究和开发更强大的深度伪造检测方法来应对这些挑战。本文利用深度伪造生成技术产生的时空不一致性，对深度伪造视频检测进行了探索，并由此提出了由相位感知流和序列流组成的交互式时空不一致性学习和交互式融合（ST-ILIF）检测方法。深度伪造视频帧中表现出的空间不一致性主要归因于傅立叶域相位分量中包含的结构信息的变化。为了缓解内容信息过度拟合的问题，我们引入了相位感知流，从基于相位的重建帧中学习空间不一致性。此外，考虑到深度伪造视频是逐帧生成的，帧与帧之间缺乏时间一致性，因此提出了一种序列流，从连续帧之间的时空差异信息中提取时间不一致性特征。最后，通过两个流的特征交互和融合，进一步提高中间特征和分类特征的表示能力。我们在四个主流数据集上对所提出的方法进行了评估，结果表明该方法优于大多数现有方法，大量实验结果也证明了该方法在识别深度伪造视频方面的有效性。我们的源代码可在以下网址获取：https://github.com/qff98/Deepfake-Video-Detection

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Multimedia Computing Communications and Applications 工程技术-计算机：理论方法

CiteScore

8.50

自引率

5.90%

发文量

285

审稿时长

7.5 months

期刊介绍： The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.