Jingze Lu , Yuxiang Zhang , Zhuo Li , Zengqiang Shang , Wenchao Wang , Pengyuan Zhang
{"title":"Leveraging distance information for generalized spoofing speech detection","authors":"Jingze Lu , Yuxiang Zhang , Zhuo Li , Zengqiang Shang , Wenchao Wang , Pengyuan Zhang","doi":"10.1016/j.csl.2025.101804","DOIUrl":null,"url":null,"abstract":"<div><div>Spoofing speech detection (SSD) systems are confronted with insufficient generalization ability for in-the-wild data, including unseen attacks and bonafide speech from unseen distributions, which hampers their applicability in real-world scenarios. Such performance degradation could be attributed to the inherent flaw of deep neural network (DNN)-based models, that is, overlearning the training data. Inter-instance distance, which is underutilized in conventional DNN-based classifiers, proves beneficial in handling unseen samples. Our experiments indicate that the distances between bonafide speech are closer than spoofing one in certain feature spaces. Therefore, this paper proposes a distance-based method to enhance anti-spoofing models’ generalization ability. By incorporating distance features as a prefix, the proposed method achieves lightweight parameter updates while effectively detecting unseen attacks and bonafide utterances from unseen distributions. On the logical access of ASVspoof 2019 and ASVspoof 2021, the proposed method achieves 0.53% and 4.73% equal error rates (EERs). Moreover, it achieves 1.86% and 7.30% EERs on the ASVspoof 2021 Deepfake and IntheWild datasets, respectively, demonstrating its superior generalization ability. The proposed method outperforms other state-of-the-art (SOTA) methods on multiple datasets.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"94 ","pages":"Article 101804"},"PeriodicalIF":3.1000,"publicationDate":"2025-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000294","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Spoofing speech detection (SSD) systems are confronted with insufficient generalization ability for in-the-wild data, including unseen attacks and bonafide speech from unseen distributions, which hampers their applicability in real-world scenarios. Such performance degradation could be attributed to the inherent flaw of deep neural network (DNN)-based models, that is, overlearning the training data. Inter-instance distance, which is underutilized in conventional DNN-based classifiers, proves beneficial in handling unseen samples. Our experiments indicate that the distances between bonafide speech are closer than spoofing one in certain feature spaces. Therefore, this paper proposes a distance-based method to enhance anti-spoofing models’ generalization ability. By incorporating distance features as a prefix, the proposed method achieves lightweight parameter updates while effectively detecting unseen attacks and bonafide utterances from unseen distributions. On the logical access of ASVspoof 2019 and ASVspoof 2021, the proposed method achieves 0.53% and 4.73% equal error rates (EERs). Moreover, it achieves 1.86% and 7.30% EERs on the ASVspoof 2021 Deepfake and IntheWild datasets, respectively, demonstrating its superior generalization ability. The proposed method outperforms other state-of-the-art (SOTA) methods on multiple datasets.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.