Deepfake audio detection with spectral features and ResNeXt-based architecture

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-05-26 DOI:10.1016/j.knosys.2025.113726

Gul Tahaoglu , Daniele Baracchi , Dasara Shullani , Massimo Iuliani , Alessandro Piva

{"title":"Deepfake audio detection with spectral features and ResNeXt-based architecture","authors":"Gul Tahaoglu , Daniele Baracchi , Dasara Shullani , Massimo Iuliani , Alessandro Piva","doi":"10.1016/j.knosys.2025.113726","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing prevalence of deepfake audio technologies and their potential for malicious use in fields such as politics and media has raised significant concerns regarding the ability to distinguish fake from authentic audio recordings. This study proposes a robust technique for detecting synthetic audio by leveraging three spectral features: Linear Frequency Cepstral Coefficients (LFCC), Mel Frequency Cepstral Coefficients (MFCC), and Constant Q Cepstral Coefficients (CQCC). These features are processed using an enhanced ResNeXt architecture to improve classification accuracy between genuine and spoofed audio. Additionally, a Multi-Layer Perceptron (MLP)-based fusion technique is employed to further boost the model’s performance. Extensive experiments were conducted using three datasets: the ASVspoof 2019 Logical Access (LA) dataset—featuring text-to-speech (TTS) and voice conversion attacks—the ASVspoof 2019 Physical Access (PA) dataset—including replay attacks—and the ASVspoof 2021 LA, PA and DF datasets. The proposed approach has demonstrated superior performance compared to state-of-the-art methods across all three datasets, particularly in detecting fake audio generated by text-to-speech (TTS) attacks. Its overall performance is summarized as follows: the system achieved an Equal Error Rate (EER) of 1.05% and a minimum tandem Detection Cost Function (min-tDCF) of 0.028 on the ASVspoof 2019 Logical Access (LA) dataset, and an EER of 1.14% and min-tDCF of 0.03 on the ASVspoof 2019 Physical Access(PA) dataset, demonstrating its robustness in detecting various types of audio spoofing attacks. Finally, on the ASVspoof 2021 LA dataset the method achieved an EER of 7.44% and min-tDCF of 0.35.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"323 ","pages":"Article 113726"},"PeriodicalIF":7.2000,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125007725","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing prevalence of deepfake audio technologies and their potential for malicious use in fields such as politics and media has raised significant concerns regarding the ability to distinguish fake from authentic audio recordings. This study proposes a robust technique for detecting synthetic audio by leveraging three spectral features: Linear Frequency Cepstral Coefficients (LFCC), Mel Frequency Cepstral Coefficients (MFCC), and Constant Q Cepstral Coefficients (CQCC). These features are processed using an enhanced ResNeXt architecture to improve classification accuracy between genuine and spoofed audio. Additionally, a Multi-Layer Perceptron (MLP)-based fusion technique is employed to further boost the model’s performance. Extensive experiments were conducted using three datasets: the ASVspoof 2019 Logical Access (LA) dataset—featuring text-to-speech (TTS) and voice conversion attacks—the ASVspoof 2019 Physical Access (PA) dataset—including replay attacks—and the ASVspoof 2021 LA, PA and DF datasets. The proposed approach has demonstrated superior performance compared to state-of-the-art methods across all three datasets, particularly in detecting fake audio generated by text-to-speech (TTS) attacks. Its overall performance is summarized as follows: the system achieved an Equal Error Rate (EER) of 1.05% and a minimum tandem Detection Cost Function (min-tDCF) of 0.028 on the ASVspoof 2019 Logical Access (LA) dataset, and an EER of 1.14% and min-tDCF of 0.03 on the ASVspoof 2019 Physical Access(PA) dataset, demonstrating its robustness in detecting various types of audio spoofing attacks. Finally, on the ASVspoof 2021 LA dataset the method achieved an EER of 7.44% and min-tDCF of 0.35.

查看原文本刊更多论文

基于光谱特征和resnext架构的深度伪造音频检测

深度伪造音频技术的日益普及，以及它们在政治和媒体等领域被恶意使用的可能性，引发了人们对区分假音频和真实音频的能力的严重担忧。本研究提出了一种检测合成音频的鲁棒技术，该技术利用了三个频谱特征：线性频谱系数（LFCC）， Mel频谱系数（MFCC）和恒Q频谱系数（CQCC）。这些特征使用增强的ResNeXt架构进行处理，以提高真实和欺骗音频之间的分类准确性。此外，采用基于多层感知器（MLP）的融合技术进一步提高了模型的性能。使用三个数据集进行了广泛的实验：ASVspoof 2019逻辑访问（LA）数据集（具有文本到语音（TTS）和语音转换攻击）、ASVspoof 2019物理访问（PA）数据集（包括重播攻击）和ASVspoof 2021 LA、PA和DF数据集。与最先进的方法相比，该方法在所有三个数据集上都表现出了卓越的性能，特别是在检测由文本到语音（TTS）攻击产生的假音频方面。其总体性能总结如下：系统在ASVspoof 2019逻辑访问（LA）数据集上的等效错误率（EER）为1.05%，最小串联检测成本函数（min-tDCF）为0.028，在ASVspoof 2019物理访问（PA）数据集上的等效错误率（EER）为1.14%，最小串联检测成本函数（min-tDCF）为0.03，证明了其在检测各种类型音频欺骗攻击方面的鲁棒性。最后，在ASVspoof 2021 LA数据集上，该方法的EER为7.44%，min-tDCF为0.35。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.