Spoof speech classification using deep speaker embeddings and machine learning models

IF 4.5 Q2 COMPUTER SCIENCE, THEORY & METHODS

Array Pub Date : 2025-08-21 DOI:10.1016/j.array.2025.100494

Mohammed Hamzah Alsalihi , Dávid Sztahó

{"title":"Spoof speech classification using deep speaker embeddings and machine learning models","authors":"Mohammed Hamzah Alsalihi , Dávid Sztahó","doi":"10.1016/j.array.2025.100494","DOIUrl":null,"url":null,"abstract":"<div><div>This paper examines the effectiveness of deep speaker embeddings combined with machine learning classifiers for spoof speech detection. We leverage four state-of-the-art speaker embedding models: X-vector, Emphasized channel attention, propagation and aggregation in time delay neural network (ECAPA-TDNN), Residual Network-Time Delay Neural Network (ResNet-TDNN), and WavLM, used in both pre-trained and fine-tuned forms, to extract speaker-discriminative features from speech signals. These embeddings are used with five classifiers: Support Vector Machine, Random Forest, Multi-Layer Perceptron, Logistic regression, and XGBoost, to classify if a speech sample is a deepfake or not. We apply multiple feature scaling strategies and assess performance using standard metrics as well as the receiver operating characteristic (ROC) curve. Our results show that fine-tuned ECAPA-TDNN embeddings consistently outperform others across classifiers. This work contributes a robust pipeline for automated spoof speech classification, serving as a critical preprocessing step for other systems like forensic voice comparison.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100494"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

This paper examines the effectiveness of deep speaker embeddings combined with machine learning classifiers for spoof speech detection. We leverage four state-of-the-art speaker embedding models: X-vector, Emphasized channel attention, propagation and aggregation in time delay neural network (ECAPA-TDNN), Residual Network-Time Delay Neural Network (ResNet-TDNN), and WavLM, used in both pre-trained and fine-tuned forms, to extract speaker-discriminative features from speech signals. These embeddings are used with five classifiers: Support Vector Machine, Random Forest, Multi-Layer Perceptron, Logistic regression, and XGBoost, to classify if a speech sample is a deepfake or not. We apply multiple feature scaling strategies and assess performance using standard metrics as well as the receiver operating characteristic (ROC) curve. Our results show that fine-tuned ECAPA-TDNN embeddings consistently outperform others across classifiers. This work contributes a robust pipeline for automated spoof speech classification, serving as a critical preprocessing step for other systems like forensic voice comparison.

查看原文本刊更多论文

欺骗语音分类使用深度说话者嵌入和机器学习模型

本文研究了深度说话人嵌入与机器学习分类器相结合用于欺骗语音检测的有效性。我们利用四种最先进的说话人嵌入模型：x向量、强调频道关注、延迟神经网络（ECAPA-TDNN）中的传播和聚合、残差网络-延迟神经网络（ResNet-TDNN）和WavLM，以预训练和微调的形式使用，从语音信号中提取说话人判别特征。这些嵌入与五个分类器一起使用：支持向量机、随机森林、多层感知器、逻辑回归和XGBoost，以分类语音样本是否为深度伪造。我们采用多种特征缩放策略，并使用标准指标以及接收者工作特征（ROC）曲线评估性能。我们的结果表明，经过微调的ECAPA-TDNN嵌入始终优于其他分类器。这项工作为自动欺骗语音分类提供了一个强大的管道，作为其他系统（如法医语音比较）的关键预处理步骤。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊