Priyanka Gupta, Piyushkumar K. Chodingala, H. Patil
{"title":"莫尔斯小波特征的流行噪声检测","authors":"Priyanka Gupta, Piyushkumar K. Chodingala, H. Patil","doi":"10.1109/SPCOM55316.2022.9840840","DOIUrl":null,"url":null,"abstract":"Spoofed Speech Detection (SSD) problem has been an important problem, especially for Automatic Speaker Verification (ASV) systems. However, the techniques used for designing countermeasure systems for SSD task are attack-specific, and therefore the solutions are far from a generalized SSD system, which can detect any type of spoofed speech. On the other hand, Voice Liveness Detection (VLD) systems rely on the characteristics of live speech (i.e., pop noise) to detect whether an utterance is live or not. Given that the attacker has the freedom to mount any type of attack, VLD systems play a crucial role in defending against spoofing attacks, irrespective of the type of spoof used by the attacker. To that effect, we propose Generalized Morse Wavelet (GMW)-based features for VLD, with Convolutional Neural Network (CNN) as the classifier at the back-end. In this context, we use pop noise as a discriminative acoustic cue to detect live speech. Pop noise is present in live speech signals at low frequencies (typically $\\leq 40$ Hz), caused by human breath reaching at the closely-placed microphone. We show that for $\\gamma =3$, the Morse wavelet has the highest concentration of information denoted by the least area of the Heisenberg’s box. Hence, we take $\\gamma =3$ for our experiments on Morse wavelets. We compare the performance of our system with Short-Time Fourier Transform (STFT)-Support Vector Machine (SVM)-based original baseline, and other existing systems, such as Constant Q-Transform (CQT)-SVM, STFT-CNN, and bump wavelet-CNN. With overall accuracy of 86.90% on evaluation set, our proposed system significantly outperforms STFT-SVM-based original baseline, CQT-SVM, STFT-CNN, and bump wavelet-CNN by an absolute margin of 18.97 %, 8. 02%, 15. 09%, and 12. 21%, respectively. Finally, we have also analyzed the effect of various phoneme types on VLD system performance.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Morse Wavelet Features for Pop Noise Detection\",\"authors\":\"Priyanka Gupta, Piyushkumar K. Chodingala, H. Patil\",\"doi\":\"10.1109/SPCOM55316.2022.9840840\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spoofed Speech Detection (SSD) problem has been an important problem, especially for Automatic Speaker Verification (ASV) systems. However, the techniques used for designing countermeasure systems for SSD task are attack-specific, and therefore the solutions are far from a generalized SSD system, which can detect any type of spoofed speech. On the other hand, Voice Liveness Detection (VLD) systems rely on the characteristics of live speech (i.e., pop noise) to detect whether an utterance is live or not. Given that the attacker has the freedom to mount any type of attack, VLD systems play a crucial role in defending against spoofing attacks, irrespective of the type of spoof used by the attacker. To that effect, we propose Generalized Morse Wavelet (GMW)-based features for VLD, with Convolutional Neural Network (CNN) as the classifier at the back-end. In this context, we use pop noise as a discriminative acoustic cue to detect live speech. Pop noise is present in live speech signals at low frequencies (typically $\\\\leq 40$ Hz), caused by human breath reaching at the closely-placed microphone. We show that for $\\\\gamma =3$, the Morse wavelet has the highest concentration of information denoted by the least area of the Heisenberg’s box. Hence, we take $\\\\gamma =3$ for our experiments on Morse wavelets. We compare the performance of our system with Short-Time Fourier Transform (STFT)-Support Vector Machine (SVM)-based original baseline, and other existing systems, such as Constant Q-Transform (CQT)-SVM, STFT-CNN, and bump wavelet-CNN. With overall accuracy of 86.90% on evaluation set, our proposed system significantly outperforms STFT-SVM-based original baseline, CQT-SVM, STFT-CNN, and bump wavelet-CNN by an absolute margin of 18.97 %, 8. 02%, 15. 09%, and 12. 21%, respectively. Finally, we have also analyzed the effect of various phoneme types on VLD system performance.\",\"PeriodicalId\":246982,\"journal\":{\"name\":\"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPCOM55316.2022.9840840\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840840","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
摘要
欺骗语音检测(SSD)问题一直是一个重要的问题,特别是在自动说话人验证(ASV)系统中。然而,用于设计SSD任务的对抗系统的技术是针对特定攻击的,因此解决方案与可以检测任何类型的欺骗语音的通用SSD系统相去甚远。另一方面,语音活性检测(VLD)系统依赖于实时语音的特征(即流行噪声)来检测话语是否实时。鉴于攻击者可以自由地发起任何类型的攻击,VLD系统在防御欺骗攻击方面发挥着至关重要的作用,而不管攻击者使用哪种类型的欺骗。为此,我们提出了基于广义莫尔斯小波(GMW)的VLD特征,并将卷积神经网络(CNN)作为后端分类器。在这种情况下,我们使用流行噪音作为判别声学线索来检测现场语音。流行噪声存在于低频率(通常为$\leq 40$ Hz)的实时语音信号中,是由人的呼吸到达靠近的麦克风引起的。我们表明,对于$\gamma =3$,莫尔斯小波具有最高的信息集中度,由海森堡盒子的最小面积表示。因此,我们选取$\gamma =3$作为摩尔斯小波的实验。我们将系统的性能与基于短时傅立叶变换(STFT)-支持向量机(SVM)的原始基线,以及其他现有系统(如常数q变换(CQT)-SVM, STFT- cnn和bump wavelet-CNN)进行了比较。总体准确率为86.90% on evaluation set, our proposed system significantly outperforms STFT-SVM-based original baseline, CQT-SVM, STFT-CNN, and bump wavelet-CNN by an absolute margin of 18.97 %, 8. 02%, 15. 09%, and 12. 21%, respectively. Finally, we have also analyzed the effect of various phoneme types on VLD system performance.
Spoofed Speech Detection (SSD) problem has been an important problem, especially for Automatic Speaker Verification (ASV) systems. However, the techniques used for designing countermeasure systems for SSD task are attack-specific, and therefore the solutions are far from a generalized SSD system, which can detect any type of spoofed speech. On the other hand, Voice Liveness Detection (VLD) systems rely on the characteristics of live speech (i.e., pop noise) to detect whether an utterance is live or not. Given that the attacker has the freedom to mount any type of attack, VLD systems play a crucial role in defending against spoofing attacks, irrespective of the type of spoof used by the attacker. To that effect, we propose Generalized Morse Wavelet (GMW)-based features for VLD, with Convolutional Neural Network (CNN) as the classifier at the back-end. In this context, we use pop noise as a discriminative acoustic cue to detect live speech. Pop noise is present in live speech signals at low frequencies (typically $\leq 40$ Hz), caused by human breath reaching at the closely-placed microphone. We show that for $\gamma =3$, the Morse wavelet has the highest concentration of information denoted by the least area of the Heisenberg’s box. Hence, we take $\gamma =3$ for our experiments on Morse wavelets. We compare the performance of our system with Short-Time Fourier Transform (STFT)-Support Vector Machine (SVM)-based original baseline, and other existing systems, such as Constant Q-Transform (CQT)-SVM, STFT-CNN, and bump wavelet-CNN. With overall accuracy of 86.90% on evaluation set, our proposed system significantly outperforms STFT-SVM-based original baseline, CQT-SVM, STFT-CNN, and bump wavelet-CNN by an absolute margin of 18.97 %, 8. 02%, 15. 09%, and 12. 21%, respectively. Finally, we have also analyzed the effect of various phoneme types on VLD system performance.