{"title":"Tex-Net: texture-based parallel branch cross-attention generalized robust Deepfake detector","authors":"Deepak Dagar, Dinesh Kumar Vishwakarma","doi":"10.1007/s00530-024-01424-7","DOIUrl":null,"url":null,"abstract":"<p>In recent years, artificial faces generated using Generative Adversarial Networks (GANs) and Variational Auto-encoders (VAEs) have become more lifelike and difficult for humans to distinguish. Deepfake refers to highly realistic and impressive media generated using deep learning technology. Convolutional Neural Networks (CNNs) have demonstrated significant potential in computer vision applications, particularly identifying fraudulent faces. However, if these networks are trained on insufficient data, they cannot effectively apply their knowledge to unfamiliar datasets, as they are susceptible to inherent biases in their learning process, such as translation, equivariance, and localization. The attention mechanism of vision transformers has effectively resolved these limits, leading to their growing popularity in recent years. This work introduces a novel module for extracting global texture information and a model that combines data from CNN (ResNet-18) and cross-attention vision transformers. The model takes in input and generates the global texture by utilizing Gram matrices and local binary patterns at each down sampling step of the ResNet-18 architecture. The ResNet-18 main branch and global texture module operate simultaneously before inputting into the visual transformer’s dual branch’s cross-attention mechanism. Initially, the empirical investigation demonstrates that counterfeit images typically display more uniform textures that are inconsistent across long distances. The model’s performance on the cross-forgery dataset is demonstrated by experiments conducted on various types of GAN images and Faceforensics + + categories. The results show that the model outperforms the scores of many state-of-the-art techniques, achieving an accuracy score of up to 85%. Furthermore, multiple tests are performed on different data samples (FF + +, DFDCPreview, Celeb-Df) that undergo post-processing techniques, including compression, noise addition, and blurring. These studies validate that the model acquires the shared distinguishing characteristics (global texture) that persist across different types of fake picture distributions, and the outcomes of these trials demonstrate that the model is resilient and can be used in many scenarios.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01424-7","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, artificial faces generated using Generative Adversarial Networks (GANs) and Variational Auto-encoders (VAEs) have become more lifelike and difficult for humans to distinguish. Deepfake refers to highly realistic and impressive media generated using deep learning technology. Convolutional Neural Networks (CNNs) have demonstrated significant potential in computer vision applications, particularly identifying fraudulent faces. However, if these networks are trained on insufficient data, they cannot effectively apply their knowledge to unfamiliar datasets, as they are susceptible to inherent biases in their learning process, such as translation, equivariance, and localization. The attention mechanism of vision transformers has effectively resolved these limits, leading to their growing popularity in recent years. This work introduces a novel module for extracting global texture information and a model that combines data from CNN (ResNet-18) and cross-attention vision transformers. The model takes in input and generates the global texture by utilizing Gram matrices and local binary patterns at each down sampling step of the ResNet-18 architecture. The ResNet-18 main branch and global texture module operate simultaneously before inputting into the visual transformer’s dual branch’s cross-attention mechanism. Initially, the empirical investigation demonstrates that counterfeit images typically display more uniform textures that are inconsistent across long distances. The model’s performance on the cross-forgery dataset is demonstrated by experiments conducted on various types of GAN images and Faceforensics + + categories. The results show that the model outperforms the scores of many state-of-the-art techniques, achieving an accuracy score of up to 85%. Furthermore, multiple tests are performed on different data samples (FF + +, DFDCPreview, Celeb-Df) that undergo post-processing techniques, including compression, noise addition, and blurring. These studies validate that the model acquires the shared distinguishing characteristics (global texture) that persist across different types of fake picture distributions, and the outcomes of these trials demonstrate that the model is resilient and can be used in many scenarios.