Towards explainable spoofed speech attribution and detection: A probabilistic approach for characterizing speech synthesizer components

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-06-11 DOI:10.1016/j.csl.2025.101840

Jagabandhu Mishra , Manasi Chhibber , Hye-jin Shim , Tomi H. Kinnunen

{"title":"Towards explainable spoofed speech attribution and detection: A probabilistic approach for characterizing speech synthesizer components","authors":"Jagabandhu Mishra , Manasi Chhibber , Hye-jin Shim , Tomi H. Kinnunen","doi":"10.1016/j.csl.2025.101840","DOIUrl":null,"url":null,"abstract":"<div><div>We propose an explainable probabilistic framework for characterizing spoofed speech by decomposing it into probabilistic attribute embeddings. Unlike raw high-dimensional countermeasure embeddings, which lack interpretability, the proposed probabilistic attribute embeddings aim to detect specific speech synthesizer components, represented through high-level attributes and their corresponding values. We use these probabilistic embeddings with four classifier back-ends to address two downstream tasks: spoofing detection and spoofing attack attribution. The former is the well-known bonafide-spoof detection task, whereas the latter seeks to identify the source method (generator) of a spoofed utterance. We additionally use Shapley values, a widely used technique in machine learning, to quantify the relative contribution of each attribute value to the decision-making process in each task. Results on the ASVspoof2019 dataset demonstrate the substantial role of waveform generator, conversion model outputs, and inputs in spoofing detection; and inputs, speaker, and duration modeling in spoofing attack attribution. In the detection task, the probabilistic attribute embeddings achieve 99.7% balanced accuracy and 0.22% equal error rate (EER), closely matching the performance of raw embeddings (99.9% balanced accuracy and 0.22% EER). Similarly, in the attribution task, our embeddings achieve 90.23% balanced accuracy and 2.07% EER, compared to 90.16% and 2.11% with raw embeddings. These results demonstrate that the proposed framework is both inherently explainable by design and capable of achieving performance comparable to raw CM embeddings.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101840"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000658","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

We propose an explainable probabilistic framework for characterizing spoofed speech by decomposing it into probabilistic attribute embeddings. Unlike raw high-dimensional countermeasure embeddings, which lack interpretability, the proposed probabilistic attribute embeddings aim to detect specific speech synthesizer components, represented through high-level attributes and their corresponding values. We use these probabilistic embeddings with four classifier back-ends to address two downstream tasks: spoofing detection and spoofing attack attribution. The former is the well-known bonafide-spoof detection task, whereas the latter seeks to identify the source method (generator) of a spoofed utterance. We additionally use Shapley values, a widely used technique in machine learning, to quantify the relative contribution of each attribute value to the decision-making process in each task. Results on the ASVspoof2019 dataset demonstrate the substantial role of waveform generator, conversion model outputs, and inputs in spoofing detection; and inputs, speaker, and duration modeling in spoofing attack attribution. In the detection task, the probabilistic attribute embeddings achieve 99.7% balanced accuracy and 0.22% equal error rate (EER), closely matching the performance of raw embeddings (99.9% balanced accuracy and 0.22% EER). Similarly, in the attribution task, our embeddings achieve 90.23% balanced accuracy and 2.07% EER, compared to 90.16% and 2.11% with raw embeddings. These results demonstrate that the proposed framework is both inherently explainable by design and capable of achieving performance comparable to raw CM embeddings.

查看原文本刊更多论文

向可解释的欺骗语音归因和检测：表征语音合成器组件的概率方法

我们提出了一个可解释的概率框架，通过将欺骗语音分解为概率属性嵌入来表征欺骗语音。与缺乏可解释性的原始高维对抗嵌入不同，本文提出的概率属性嵌入旨在检测通过高级属性及其对应值表示的特定语音合成器组件。我们使用这些概率嵌入与四个分类器后端来解决两个下游任务：欺骗检测和欺骗攻击归因。前者是众所周知的虚假欺骗检测任务，而后者试图识别欺骗话语的源方法（生成器）。此外，我们还使用Shapley值（一种在机器学习中广泛使用的技术）来量化每个任务中每个属性值对决策过程的相对贡献。在ASVspoof2019数据集上的结果证明了波形发生器、转换模型输出和输入在欺骗检测中的重要作用；以及欺骗攻击归因中的输入、说话人和持续时间建模。在检测任务中，概率属性嵌入达到99.7%的平衡精度和0.22%的等错误率（EER），与原始嵌入的99.9%的平衡精度和0.22%的EER非常接近。同样，在归因任务中，我们的嵌入实现了90.23%的平衡准确率和2.07%的EER，而原始嵌入的平衡准确率和EER分别为90.16%和2.11%。这些结果表明，所提出的框架既具有内在的可解释性，又能够实现与原始CM嵌入相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.