{"title":"欺骗语音分类使用深度说话者嵌入和机器学习模型","authors":"Mohammed Hamzah Alsalihi , Dávid Sztahó","doi":"10.1016/j.array.2025.100494","DOIUrl":null,"url":null,"abstract":"<div><div>This paper examines the effectiveness of deep speaker embeddings combined with machine learning classifiers for spoof speech detection. We leverage four state-of-the-art speaker embedding models: X-vector, Emphasized channel attention, propagation and aggregation in time delay neural network (ECAPA-TDNN), Residual Network-Time Delay Neural Network (ResNet-TDNN), and WavLM, used in both pre-trained and fine-tuned forms, to extract speaker-discriminative features from speech signals. These embeddings are used with five classifiers: Support Vector Machine, Random Forest, Multi-Layer Perceptron, Logistic regression, and XGBoost, to classify if a speech sample is a deepfake or not. We apply multiple feature scaling strategies and assess performance using standard metrics as well as the receiver operating characteristic (ROC) curve. Our results show that fine-tuned ECAPA-TDNN embeddings consistently outperform others across classifiers. This work contributes a robust pipeline for automated spoof speech classification, serving as a critical preprocessing step for other systems like forensic voice comparison.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100494"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spoof speech classification using deep speaker embeddings and machine learning models\",\"authors\":\"Mohammed Hamzah Alsalihi , Dávid Sztahó\",\"doi\":\"10.1016/j.array.2025.100494\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This paper examines the effectiveness of deep speaker embeddings combined with machine learning classifiers for spoof speech detection. We leverage four state-of-the-art speaker embedding models: X-vector, Emphasized channel attention, propagation and aggregation in time delay neural network (ECAPA-TDNN), Residual Network-Time Delay Neural Network (ResNet-TDNN), and WavLM, used in both pre-trained and fine-tuned forms, to extract speaker-discriminative features from speech signals. These embeddings are used with five classifiers: Support Vector Machine, Random Forest, Multi-Layer Perceptron, Logistic regression, and XGBoost, to classify if a speech sample is a deepfake or not. We apply multiple feature scaling strategies and assess performance using standard metrics as well as the receiver operating characteristic (ROC) curve. Our results show that fine-tuned ECAPA-TDNN embeddings consistently outperform others across classifiers. This work contributes a robust pipeline for automated spoof speech classification, serving as a critical preprocessing step for other systems like forensic voice comparison.</div></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":\"27 \",\"pages\":\"Article 100494\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005625001213\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Spoof speech classification using deep speaker embeddings and machine learning models
This paper examines the effectiveness of deep speaker embeddings combined with machine learning classifiers for spoof speech detection. We leverage four state-of-the-art speaker embedding models: X-vector, Emphasized channel attention, propagation and aggregation in time delay neural network (ECAPA-TDNN), Residual Network-Time Delay Neural Network (ResNet-TDNN), and WavLM, used in both pre-trained and fine-tuned forms, to extract speaker-discriminative features from speech signals. These embeddings are used with five classifiers: Support Vector Machine, Random Forest, Multi-Layer Perceptron, Logistic regression, and XGBoost, to classify if a speech sample is a deepfake or not. We apply multiple feature scaling strategies and assess performance using standard metrics as well as the receiver operating characteristic (ROC) curve. Our results show that fine-tuned ECAPA-TDNN embeddings consistently outperform others across classifiers. This work contributes a robust pipeline for automated spoof speech classification, serving as a critical preprocessing step for other systems like forensic voice comparison.