Spoofing countermeasure for fake speech detection using brute force features

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-10-02 DOI:10.1016/j.csl.2024.101732

Arsalan Rahman Mirza , Abdulbasit K. Al-Talabani

{"title":"Spoofing countermeasure for fake speech detection using brute force features","authors":"Arsalan Rahman Mirza , Abdulbasit K. Al-Talabani","doi":"10.1016/j.csl.2024.101732","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the progress in deep learning technology, techniques that generate spoofed speech have significantly emerged. Such synthetic speech can be exploited for harmful purposes, like impersonation or disseminating false information. Researchers in the area investigate the useful features for spoof detection. This paper extensively investigates three problems in spoof detection in speech, namely, the imbalanced sample per class, which may negatively affect the performance of any detection models, the effect of the feature early and late fusion, and the analysis of unseen attacks on the model. Regarding the imbalanced issue, we have proposed two approaches (a Synthetic Minority Over Sampling Technique (SMOTE)-based and a Bootstrap-based model). We have used the OpenSMILE toolkit, to extract different feature sets, their results and early and late fusion of them have been investigated. The experiments are evaluated using the ASVspoof 2019 datasets which encompass synthetic, voice-conversion, and replayed speech samples. Additionally, Support Vector Machine (SVM) and Deep Neural Network (DNN) have been adopted in the classification. The outcomes from various test scenarios indicated that neither the imbalanced nature of the dataset nor a specific feature or their fusions outperformed the brute force version of the model as the best Equal Error Rate (EER) achieved by the Imbalance model is 6.67 % and 1.80 % for both Logical Access (LA) and Physical Access (PA) respectively.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101732"},"PeriodicalIF":3.4000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824001153","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Due to the progress in deep learning technology, techniques that generate spoofed speech have significantly emerged. Such synthetic speech can be exploited for harmful purposes, like impersonation or disseminating false information. Researchers in the area investigate the useful features for spoof detection. This paper extensively investigates three problems in spoof detection in speech, namely, the imbalanced sample per class, which may negatively affect the performance of any detection models, the effect of the feature early and late fusion, and the analysis of unseen attacks on the model. Regarding the imbalanced issue, we have proposed two approaches (a Synthetic Minority Over Sampling Technique (SMOTE)-based and a Bootstrap-based model). We have used the OpenSMILE toolkit, to extract different feature sets, their results and early and late fusion of them have been investigated. The experiments are evaluated using the ASVspoof 2019 datasets which encompass synthetic, voice-conversion, and replayed speech samples. Additionally, Support Vector Machine (SVM) and Deep Neural Network (DNN) have been adopted in the classification. The outcomes from various test scenarios indicated that neither the imbalanced nature of the dataset nor a specific feature or their fusions outperformed the brute force version of the model as the best Equal Error Rate (EER) achieved by the Imbalance model is 6.67 % and 1.80 % for both Logical Access (LA) and Physical Access (PA) respectively.

查看原文本刊更多论文

利用暴力特征检测假语音的欺骗对策

由于深度学习技术的进步，生成欺骗性语音的技术已经大量涌现。这种合成语音可被用于有害目的，如冒名顶替或传播虚假信息。该领域的研究人员正在研究用于欺骗检测的有用特征。本文广泛研究了语音欺骗检测中的三个问题，即每类样本的不平衡（这可能会对任何检测模型的性能产生负面影响）、特征早期和晚期融合的影响以及对模型的未见攻击分析。关于不平衡问题，我们提出了两种方法（基于合成少数群体过度采样技术（SMOTE）的模型和基于 Bootstrap 的模型）。我们使用 OpenSMILE 工具包提取了不同的特征集，并对其结果以及早期和晚期融合进行了研究。实验使用 ASVspoof 2019 数据集进行评估，其中包括合成、语音转换和重放语音样本。此外，分类中还采用了支持向量机（SVM）和深度神经网络（DNN）。各种测试场景的结果表明，无论是数据集的不平衡性，还是特定特征或它们的融合，其性能都优于蛮力版本的模型，因为不平衡模型在逻辑访问（LA）和物理访问（PA）方面实现的最佳等错误率（EER）分别为 6.67 % 和 1.80 %。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.