{"title":"基于光谱特征和resnext架构的深度伪造音频检测","authors":"Gul Tahaoglu , Daniele Baracchi , Dasara Shullani , Massimo Iuliani , Alessandro Piva","doi":"10.1016/j.knosys.2025.113726","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing prevalence of deepfake audio technologies and their potential for malicious use in fields such as politics and media has raised significant concerns regarding the ability to distinguish fake from authentic audio recordings. This study proposes a robust technique for detecting synthetic audio by leveraging three spectral features: Linear Frequency Cepstral Coefficients (LFCC), Mel Frequency Cepstral Coefficients (MFCC), and Constant Q Cepstral Coefficients (CQCC). These features are processed using an enhanced ResNeXt architecture to improve classification accuracy between genuine and spoofed audio. Additionally, a Multi-Layer Perceptron (MLP)-based fusion technique is employed to further boost the model’s performance. Extensive experiments were conducted using three datasets: the ASVspoof 2019 Logical Access (LA) dataset—featuring text-to-speech (TTS) and voice conversion attacks—the ASVspoof 2019 Physical Access (PA) dataset—including replay attacks—and the ASVspoof 2021 LA, PA and DF datasets. The proposed approach has demonstrated superior performance compared to state-of-the-art methods across all three datasets, particularly in detecting fake audio generated by text-to-speech (TTS) attacks. Its overall performance is summarized as follows: the system achieved an Equal Error Rate (EER) of 1.05% and a minimum tandem Detection Cost Function (min-tDCF) of 0.028 on the ASVspoof 2019 Logical Access (LA) dataset, and an EER of 1.14% and min-tDCF of 0.03 on the ASVspoof 2019 Physical Access(PA) dataset, demonstrating its robustness in detecting various types of audio spoofing attacks. Finally, on the ASVspoof 2021 LA dataset the method achieved an EER of 7.44% and min-tDCF of 0.35.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"323 ","pages":"Article 113726"},"PeriodicalIF":7.2000,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deepfake audio detection with spectral features and ResNeXt-based architecture\",\"authors\":\"Gul Tahaoglu , Daniele Baracchi , Dasara Shullani , Massimo Iuliani , Alessandro Piva\",\"doi\":\"10.1016/j.knosys.2025.113726\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The increasing prevalence of deepfake audio technologies and their potential for malicious use in fields such as politics and media has raised significant concerns regarding the ability to distinguish fake from authentic audio recordings. This study proposes a robust technique for detecting synthetic audio by leveraging three spectral features: Linear Frequency Cepstral Coefficients (LFCC), Mel Frequency Cepstral Coefficients (MFCC), and Constant Q Cepstral Coefficients (CQCC). These features are processed using an enhanced ResNeXt architecture to improve classification accuracy between genuine and spoofed audio. Additionally, a Multi-Layer Perceptron (MLP)-based fusion technique is employed to further boost the model’s performance. Extensive experiments were conducted using three datasets: the ASVspoof 2019 Logical Access (LA) dataset—featuring text-to-speech (TTS) and voice conversion attacks—the ASVspoof 2019 Physical Access (PA) dataset—including replay attacks—and the ASVspoof 2021 LA, PA and DF datasets. The proposed approach has demonstrated superior performance compared to state-of-the-art methods across all three datasets, particularly in detecting fake audio generated by text-to-speech (TTS) attacks. Its overall performance is summarized as follows: the system achieved an Equal Error Rate (EER) of 1.05% and a minimum tandem Detection Cost Function (min-tDCF) of 0.028 on the ASVspoof 2019 Logical Access (LA) dataset, and an EER of 1.14% and min-tDCF of 0.03 on the ASVspoof 2019 Physical Access(PA) dataset, demonstrating its robustness in detecting various types of audio spoofing attacks. Finally, on the ASVspoof 2021 LA dataset the method achieved an EER of 7.44% and min-tDCF of 0.35.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"323 \",\"pages\":\"Article 113726\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-05-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125007725\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125007725","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Deepfake audio detection with spectral features and ResNeXt-based architecture
The increasing prevalence of deepfake audio technologies and their potential for malicious use in fields such as politics and media has raised significant concerns regarding the ability to distinguish fake from authentic audio recordings. This study proposes a robust technique for detecting synthetic audio by leveraging three spectral features: Linear Frequency Cepstral Coefficients (LFCC), Mel Frequency Cepstral Coefficients (MFCC), and Constant Q Cepstral Coefficients (CQCC). These features are processed using an enhanced ResNeXt architecture to improve classification accuracy between genuine and spoofed audio. Additionally, a Multi-Layer Perceptron (MLP)-based fusion technique is employed to further boost the model’s performance. Extensive experiments were conducted using three datasets: the ASVspoof 2019 Logical Access (LA) dataset—featuring text-to-speech (TTS) and voice conversion attacks—the ASVspoof 2019 Physical Access (PA) dataset—including replay attacks—and the ASVspoof 2021 LA, PA and DF datasets. The proposed approach has demonstrated superior performance compared to state-of-the-art methods across all three datasets, particularly in detecting fake audio generated by text-to-speech (TTS) attacks. Its overall performance is summarized as follows: the system achieved an Equal Error Rate (EER) of 1.05% and a minimum tandem Detection Cost Function (min-tDCF) of 0.028 on the ASVspoof 2019 Logical Access (LA) dataset, and an EER of 1.14% and min-tDCF of 0.03 on the ASVspoof 2019 Physical Access(PA) dataset, demonstrating its robustness in detecting various types of audio spoofing attacks. Finally, on the ASVspoof 2021 LA dataset the method achieved an EER of 7.44% and min-tDCF of 0.35.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.