Audio-Video Analysis Method of Public Speaking Videos to Detect Deepfake Threat

Robert Wolański, Karol Jędrasiak
{"title":"Audio-Video Analysis Method of Public Speaking Videos to Detect Deepfake Threat","authors":"Robert Wolański, Karol Jędrasiak","doi":"10.12845/sft.62.2.2023.10","DOIUrl":null,"url":null,"abstract":"Aim: The purpose of the article is to present the hypothesis that the use of discrepancies in audiovisual materials can significantly increase the effectiveness of detecting various types of deepfake and related threats. In order to verify this hypothesis, the authors proposed a new method that reveals inconsistencies in both multiple modalities simultaneously and within individual modalities separately, enabling them to effectively distinguish between authentic and altered public speaking videos. Project and methods: The proposed approach is to integrate audio and visual signals in a so-called fine-grained manner, and then carry out binary classification processes based on calculated adjustments to the classification results of each modality. The method has been tested using various network architectures, in particular Capsule networks – for deep anomaly detection and Swin Transformer – for image classification. Pre-processing included frame extraction and face detection using the MTCNN algorithm, as well as conversion of audio to mel spectrograms to better reflect human auditory perception. The proposed technique was tested on multimodal deepfake datasets, namely FakeAVCeleb and TMC, along with a custom dataset containing 4,700 recordings. The method has shown high performance in identifying deepfake threats in various test scenarios. Results: The method proposed by the authors achieved better AUC and accuracy compared to other reference methods, confirming its effectiveness in the analysis of multimodal artefacts. The test results confirm that it is effective in detecting modified videos in a variety of test scenarios which can be considered an advance over existing deepfake detection techniques. The results highlight the adaptability of the method in various architectures of feature extraction networks. Conclusions: The presented method of audiovisual deepfake detection uses fine inconsistencies of multimodal features to distinguish whether the material is authentic or synthetic. It is distinguished by its ability to point out inconsistencies in different types of deepfakes and, within each individual modality, can effectively distinguish authentic content from manipulated counterparts. The adaptability has been confirmed by the successful application of the method in various feature extraction network architectures. Moreover, its effectiveness has been proven in rigorous tests on two different audiovisual deepfake datasets. Keywords: analysis of audio-video stream, detection of deepfake threats, analysis of public speeches","PeriodicalId":113945,"journal":{"name":"Safety & Fire Technology","volume":"107 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Safety & Fire Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12845/sft.62.2.2023.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Aim: The purpose of the article is to present the hypothesis that the use of discrepancies in audiovisual materials can significantly increase the effectiveness of detecting various types of deepfake and related threats. In order to verify this hypothesis, the authors proposed a new method that reveals inconsistencies in both multiple modalities simultaneously and within individual modalities separately, enabling them to effectively distinguish between authentic and altered public speaking videos. Project and methods: The proposed approach is to integrate audio and visual signals in a so-called fine-grained manner, and then carry out binary classification processes based on calculated adjustments to the classification results of each modality. The method has been tested using various network architectures, in particular Capsule networks – for deep anomaly detection and Swin Transformer – for image classification. Pre-processing included frame extraction and face detection using the MTCNN algorithm, as well as conversion of audio to mel spectrograms to better reflect human auditory perception. The proposed technique was tested on multimodal deepfake datasets, namely FakeAVCeleb and TMC, along with a custom dataset containing 4,700 recordings. The method has shown high performance in identifying deepfake threats in various test scenarios. Results: The method proposed by the authors achieved better AUC and accuracy compared to other reference methods, confirming its effectiveness in the analysis of multimodal artefacts. The test results confirm that it is effective in detecting modified videos in a variety of test scenarios which can be considered an advance over existing deepfake detection techniques. The results highlight the adaptability of the method in various architectures of feature extraction networks. Conclusions: The presented method of audiovisual deepfake detection uses fine inconsistencies of multimodal features to distinguish whether the material is authentic or synthetic. It is distinguished by its ability to point out inconsistencies in different types of deepfakes and, within each individual modality, can effectively distinguish authentic content from manipulated counterparts. The adaptability has been confirmed by the successful application of the method in various feature extraction network architectures. Moreover, its effectiveness has been proven in rigorous tests on two different audiovisual deepfake datasets. Keywords: analysis of audio-video stream, detection of deepfake threats, analysis of public speeches
利用公开演讲视频的音视频分析方法检测 Deepfake 威胁
目的:文章旨在提出一个假设,即利用视听材料中的差异可以显著提高检测各种类型的深度伪造和相关威胁的有效性。为了验证这一假设,作者提出了一种新方法,既能同时揭示多种模态中的不一致性,又能分别揭示单个模态中的不一致性,从而能够有效区分公开演讲视频的真伪。项目和方法:所提出的方法是以所谓的细粒度方式整合音频和视频信号,然后根据对每种模态分类结果的计算调整结果进行二元分类处理。该方法使用各种网络架构进行了测试,特别是用于深度异常检测的胶囊网络和用于图像分类的斯温变换器。预处理包括使用 MTCNN 算法进行帧提取和人脸检测,以及将音频转换为 mel 频谱图,以更好地反映人类的听觉感知。所提出的技术在多模态深度伪造数据集(即 FakeAVCeleb 和 TMC)以及包含 4,700 个录音的自定义数据集上进行了测试。在各种测试场景中,该方法在识别深度伪造威胁方面表现出了很高的性能。结果:与其他参考方法相比,作者提出的方法取得了更好的 AUC 和准确率,证实了它在分析多模态伪装方面的有效性。测试结果证实,该方法能在各种测试场景中有效检测出修改过的视频,可以说是现有深度伪造检测技术的一大进步。结果凸显了该方法在各种特征提取网络架构中的适应性。结论所介绍的视听深度防伪检测方法利用多模态特征的细微不一致来区分材料是真实的还是合成的。该方法的显著特点是能够指出不同类型深度伪造内容中的不一致之处,并且在每种模式中都能有效区分真实内容和经过处理的对应内容。该方法在各种特征提取网络架构中的成功应用证实了其适应性。此外,在两个不同的视听深度伪造数据集上进行的严格测试也证明了该方法的有效性。关键词:音视频流分析、深度伪造威胁检测、公开演讲分析
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信