SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection

Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation Pub Date : 2023-06-12 DOI:10.1145/3592572.3592841

Awais Khan, K. Malik

{"title":"SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection","authors":"Awais Khan, K. Malik","doi":"10.1145/3592572.3592841","DOIUrl":null,"url":null,"abstract":"The prevalence of voice spoofing attacks in today’s digital world has become a critical security concern. Attackers employ various techniques, such as voice conversion (VC) and text-to-speech (TTS), to generate synthetic speech that imitates the victim’s voice and gain access to sensitive information. The recent advances in synthetic speech generation pose a significant threat to modern security systems, while traditional voice authentication methods are incapable of detecting them effectively. To address this issue, a novel solution for logical access (LA)-based synthetic speech detection is proposed in this paper. SpoTNet is an attention-based spoofing transformer network that includes crafted front-end spoofing features and deep attentive features retrieved using the developed logical spoofing transformer encoder (LSTE). The derived attentive features were then processed by the proposed multi-layer spoofing classifier to classify speech samples as bona fide or synthetic. In synthetic speeches produced by the TTS algorithm, the spectral characteristics of the synthetic speech are altered to match the target speaker’s formant frequencies, while in VC attacks, the temporal alignment of the speech segments is manipulated to preserve the target speaker’s prosodic features. By highlighting these observations, this paper targets the prosodic and phonetic-based crafted features, i.e., the Mel-spectrogram, spectral contrast, and spectral envelope, presenting an effective preprocessing pipeline proven to be effective in synthetic speech detection. The proposed solution achieved state-of-the-art performance against eight recent feature fusion methods with lower EER of 0.95% on the ASVspoof-LA dataset, demonstrating its potential to advance the field of speaker identification and improve speaker recognition systems.","PeriodicalId":239252,"journal":{"name":"Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3592572.3592841","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The prevalence of voice spoofing attacks in today’s digital world has become a critical security concern. Attackers employ various techniques, such as voice conversion (VC) and text-to-speech (TTS), to generate synthetic speech that imitates the victim’s voice and gain access to sensitive information. The recent advances in synthetic speech generation pose a significant threat to modern security systems, while traditional voice authentication methods are incapable of detecting them effectively. To address this issue, a novel solution for logical access (LA)-based synthetic speech detection is proposed in this paper. SpoTNet is an attention-based spoofing transformer network that includes crafted front-end spoofing features and deep attentive features retrieved using the developed logical spoofing transformer encoder (LSTE). The derived attentive features were then processed by the proposed multi-layer spoofing classifier to classify speech samples as bona fide or synthetic. In synthetic speeches produced by the TTS algorithm, the spectral characteristics of the synthetic speech are altered to match the target speaker’s formant frequencies, while in VC attacks, the temporal alignment of the speech segments is manipulated to preserve the target speaker’s prosodic features. By highlighting these observations, this paper targets the prosodic and phonetic-based crafted features, i.e., the Mel-spectrogram, spectral contrast, and spectral envelope, presenting an effective preprocessing pipeline proven to be effective in synthetic speech detection. The proposed solution achieved state-of-the-art performance against eight recent feature fusion methods with lower EER of 0.95% on the ASVspoof-LA dataset, demonstrating its potential to advance the field of speaker identification and improve speaker recognition systems.

查看原文本刊更多论文

SpoTNet:一种用于有效合成语音检测的欺骗感知变压器网络

在当今的数字世界中，语音欺骗攻击的盛行已经成为一个关键的安全问题。攻击者使用各种技术，例如语音转换(VC)和文本到语音(TTS)，来生成模仿受害者声音的合成语音并获得敏感信息的访问权限。近年来合成语音的发展对现代安防系统构成了重大威胁，而传统的语音认证方法无法有效检测合成语音。为了解决这一问题，本文提出了一种基于逻辑访问(LA)的合成语音检测方法。SpoTNet是一个基于注意力的欺骗变压器网络，包括精心制作的前端欺骗特征和使用开发的逻辑欺骗变压器编码器(LSTE)检索的深度注意特征。然后利用多层欺骗分类器对得到的关注特征进行处理，将语音样本分类为真实的或合成的。在由TTS算法产生的合成语音中，合成语音的频谱特征被改变以匹配目标说话人的共振频率，而在VC攻击中，语音片段的时间排列被操纵以保持目标说话人的韵律特征。通过强调这些观察结果，本文针对基于韵律和语音的精心制作的特征，即mel谱图，谱对比和谱包络，提出了一个有效的预处理管道，被证明在合成语音检测中是有效的。在ASVspoof-LA数据集上，该解决方案在八种最新特征融合方法中取得了最先进的性能，EER低于0.95%，显示了其在推进说话人识别领域和改进说话人识别系统方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation

自引率

0.00%

发文量