Lightweight and hybrid transformer-based solution for quick and reliable deepfake detection.

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data Pub Date : 2025-04-01 eCollection Date: 2025-01-01 DOI:10.3389/fdata.2025.1521653

Geeta Rani, Atharv Kothekar, Shawn George Philip, Vijaypal Singh Dhaka, Ester Zumpano, Eugenio Vocaturo

{"title":"Lightweight and hybrid transformer-based solution for quick and reliable deepfake detection.","authors":"Geeta Rani, Atharv Kothekar, Shawn George Philip, Vijaypal Singh Dhaka, Ester Zumpano, Eugenio Vocaturo","doi":"10.3389/fdata.2025.1521653","DOIUrl":null,"url":null,"abstract":"Introduction: Rapid advancements in artificial intelligence and generative artificial intelligence have enabled the creation of fake images and videos that appear highly realistic. According to a report published in 2022, approximately 71% of people rely on fake videos and become victims of blackmail. Moreover, these fake videos and images are used to tarnish the reputation of popular public figures. This has increased the demand for deepfake detection techniques. The accuracy of the techniques proposed in the literature so far varies with changes in fake content generation techniques. Additionally, these techniques are computationally intensive. The techniques discussed in the literature are based on convolutional neural networks, Linformer models, or transformer models for deepfake detection, each with its advantages and disadvantages.Methods: In this manuscript, a hybrid architecture combining transformer and Linformer models is proposed for deepfake detection. This architecture converts an image into patches and performs position encoding to retain spatial relationships between patches. Its encoder captures the contextual information from the input patches, and Gaussian Error Linear Unit resolves the vanishing gradient problem.Results: The Linformer component reduces the size of the attention matrix. Thus, it reduces the execution time to half without compromising accuracy. Moreover, it utilizes the unique features of transformer and Linformer models to enhance the robustness and generalization of deepfake detection techniques. The low computational requirement and high accuracy of 98.9% increase the real-time applicability of the model, preventing blackmail and other losses to the public.Discussion: The proposed hybrid model utilizes the strength of the transformer model in capturing complex patterns in data. It uses the self-attention potential of the Linformer model and reduces the computation time without compromising the accuracy. Moreover, the models were implemented on patch sizes of 6 and 11. It is evident from the obtained results that increasing the patch size improves the performance of the model. This allows the model to capture fine-grained features and learn more effectively from the same set of videos. The larger patch size also enables the model to better preserve spatial details, which contributes to improved feature extraction.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"8 ","pages":"1521653"},"PeriodicalIF":2.4000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12023275/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2025.1521653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Rapid advancements in artificial intelligence and generative artificial intelligence have enabled the creation of fake images and videos that appear highly realistic. According to a report published in 2022, approximately 71% of people rely on fake videos and become victims of blackmail. Moreover, these fake videos and images are used to tarnish the reputation of popular public figures. This has increased the demand for deepfake detection techniques. The accuracy of the techniques proposed in the literature so far varies with changes in fake content generation techniques. Additionally, these techniques are computationally intensive. The techniques discussed in the literature are based on convolutional neural networks, Linformer models, or transformer models for deepfake detection, each with its advantages and disadvantages.

Methods: In this manuscript, a hybrid architecture combining transformer and Linformer models is proposed for deepfake detection. This architecture converts an image into patches and performs position encoding to retain spatial relationships between patches. Its encoder captures the contextual information from the input patches, and Gaussian Error Linear Unit resolves the vanishing gradient problem.

Results: The Linformer component reduces the size of the attention matrix. Thus, it reduces the execution time to half without compromising accuracy. Moreover, it utilizes the unique features of transformer and Linformer models to enhance the robustness and generalization of deepfake detection techniques. The low computational requirement and high accuracy of 98.9% increase the real-time applicability of the model, preventing blackmail and other losses to the public.

Discussion: The proposed hybrid model utilizes the strength of the transformer model in capturing complex patterns in data. It uses the self-attention potential of the Linformer model and reduces the computation time without compromising the accuracy. Moreover, the models were implemented on patch sizes of 6 and 11. It is evident from the obtained results that increasing the patch size improves the performance of the model. This allows the model to capture fine-grained features and learn more effectively from the same set of videos. The larger patch size also enables the model to better preserve spatial details, which contributes to improved feature extraction.

Abstract Image

查看原文本刊更多论文

轻量级和混合变压器为基础的解决方案，快速，可靠的深度假检测。

导读：人工智能和生成式人工智能的快速发展使人们能够制作出高度逼真的假图像和视频。根据2022年发布的一份报告，大约71%的人依赖虚假视频并成为勒索的受害者。此外，这些假视频和图片被用来玷污受欢迎的公众人物的声誉。这增加了对深度伪造检测技术的需求。到目前为止，文献中提出的技术的准确性随着虚假内容生成技术的变化而变化。此外，这些技术是计算密集型的。文献中讨论的技术基于卷积神经网络、Linformer模型或变压器模型进行深度检测，每种技术都有其优点和缺点。方法：在本文中，提出了一种结合变压器和Linformer模型的混合结构，用于深度伪造检测。该结构将图像转换成小块，并进行位置编码以保持小块之间的空间关系。它的编码器从输入补丁中捕获上下文信息，高斯误差线性单元解决了梯度消失问题。结果：Linformer组件减小了注意力矩阵的大小。因此，它可以在不影响准确性的情况下将执行时间减少一半。此外，它利用变压器和Linformer模型的独特特性来增强深度假检测技术的鲁棒性和泛化性。计算量低，准确率高达98.9%，增加了模型的实时性，防止了敲诈勒索等对公众的损失。讨论：提出的混合模型利用了转换器模型在捕获数据中的复杂模式方面的优势。它利用了Linformer模型的自注意潜能，在不影响精度的前提下减少了计算时间。模型分别在patch尺寸为6和11的情况下实现。从得到的结果可以看出，增大patch的大小可以提高模型的性能。这使得模型能够捕获细粒度的特征，并从同一组视频中更有效地学习。更大的patch尺寸也使模型能够更好地保留空间细节，从而有助于改进特征提取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊