{"title":"用于欺骗性语音检测的互补区域能量特征","authors":"Gökay Dişken","doi":"10.1016/j.csl.2023.101602","DOIUrl":null,"url":null,"abstract":"<div><p><span><span>Automatic speaker verification systems are found to be vulnerable to spoof attacks such as voice conversion, text-to-speech, and replayed speech. As the security of </span>biometric<span> systems is vital, many countermeasures have been developed for spoofed speech detection. To satisfy the recent developments on </span></span>speech synthesis<span>, publicly available datasets became more and more challenging (e.g., ASVspoof 2019 and 2021 datasets). A variety of replay attack configurations were also considered in those datasets, as they do not require expertise, hence easily performed. This work utilizes regional energy features, which are experimentally proven to be more effective than the traditional frame-based energy features. The proposed energy features are independent from the utterance length and are extracted over nonoverlapping time-frequency regions of the magnitude spectrum. Different configurations are considered in the experiments to verify the regional energy features’ contribution to the performance. First, light convolutional neural network<span> – long short-term memory (LCNN – LSTM) model with linear frequency cepstral coefficients<span> is used to determine the optimal number of regional energy features. Then, SE-Res2Net model with log power spectrogram features is used, which achieved comparable results to the state-of-the-art for ASVspoof 2019 logical access condition. Physical access condition from ASVspoof 2019 dataset, logical access and deep fake conditions from ASVspoof 2021 dataset are also used in the experiments. The regional energy features achieved improvements for all conditions with almost no additional computational or memory loads (less than 1% increase in the model size for SE-Res2Net). The main advantages of the regional energy features can be summarized as i) capturing nonspeech segments, ii) extracting band-limited information. Both aspects are found to be discriminative for spoofed speech detection.</span></span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Complementary regional energy features for spoofed speech detection\",\"authors\":\"Gökay Dişken\",\"doi\":\"10.1016/j.csl.2023.101602\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p><span><span>Automatic speaker verification systems are found to be vulnerable to spoof attacks such as voice conversion, text-to-speech, and replayed speech. As the security of </span>biometric<span> systems is vital, many countermeasures have been developed for spoofed speech detection. To satisfy the recent developments on </span></span>speech synthesis<span>, publicly available datasets became more and more challenging (e.g., ASVspoof 2019 and 2021 datasets). A variety of replay attack configurations were also considered in those datasets, as they do not require expertise, hence easily performed. This work utilizes regional energy features, which are experimentally proven to be more effective than the traditional frame-based energy features. The proposed energy features are independent from the utterance length and are extracted over nonoverlapping time-frequency regions of the magnitude spectrum. Different configurations are considered in the experiments to verify the regional energy features’ contribution to the performance. First, light convolutional neural network<span> – long short-term memory (LCNN – LSTM) model with linear frequency cepstral coefficients<span> is used to determine the optimal number of regional energy features. Then, SE-Res2Net model with log power spectrogram features is used, which achieved comparable results to the state-of-the-art for ASVspoof 2019 logical access condition. Physical access condition from ASVspoof 2019 dataset, logical access and deep fake conditions from ASVspoof 2021 dataset are also used in the experiments. The regional energy features achieved improvements for all conditions with almost no additional computational or memory loads (less than 1% increase in the model size for SE-Res2Net). The main advantages of the regional energy features can be summarized as i) capturing nonspeech segments, ii) extracting band-limited information. Both aspects are found to be discriminative for spoofed speech detection.</span></span></span></p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2023-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230823001213\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823001213","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Complementary regional energy features for spoofed speech detection
Automatic speaker verification systems are found to be vulnerable to spoof attacks such as voice conversion, text-to-speech, and replayed speech. As the security of biometric systems is vital, many countermeasures have been developed for spoofed speech detection. To satisfy the recent developments on speech synthesis, publicly available datasets became more and more challenging (e.g., ASVspoof 2019 and 2021 datasets). A variety of replay attack configurations were also considered in those datasets, as they do not require expertise, hence easily performed. This work utilizes regional energy features, which are experimentally proven to be more effective than the traditional frame-based energy features. The proposed energy features are independent from the utterance length and are extracted over nonoverlapping time-frequency regions of the magnitude spectrum. Different configurations are considered in the experiments to verify the regional energy features’ contribution to the performance. First, light convolutional neural network – long short-term memory (LCNN – LSTM) model with linear frequency cepstral coefficients is used to determine the optimal number of regional energy features. Then, SE-Res2Net model with log power spectrogram features is used, which achieved comparable results to the state-of-the-art for ASVspoof 2019 logical access condition. Physical access condition from ASVspoof 2019 dataset, logical access and deep fake conditions from ASVspoof 2021 dataset are also used in the experiments. The regional energy features achieved improvements for all conditions with almost no additional computational or memory loads (less than 1% increase in the model size for SE-Res2Net). The main advantages of the regional energy features can be summarized as i) capturing nonspeech segments, ii) extracting band-limited information. Both aspects are found to be discriminative for spoofed speech detection.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.