Xiaoping Xie, Can Li, Dan Tian, Rufeng Shen, Fei Ding
{"title":"基于质量评价的单音语音分离新研究","authors":"Xiaoping Xie, Can Li, Dan Tian, Rufeng Shen, Fei Ding","doi":"10.1016/j.csl.2023.101601","DOIUrl":null,"url":null,"abstract":"<div><p>Speech enhancement (SE) is a pivotal technology in enhancing the quality and intelligibility of speech signals. Nevertheless, when processing speech signals under conditions of high signal-to-noise ratio (SNR), conventional SE techniques may inadvertently lead to a diminution in the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This article introduces the innovative incorporation of the Non-Intrusive Speech Quality Assessment (NISQA) algorithm into SE systems. Through the comparison of pre and post-enhancement speech quality scores, it discerns whether the speech signal under consideration warrants enhancement processing, thereby mitigating potential deterioration in PESQ and STOI. Furthermore, this study delves into the ramifications of five prevalent speech features, namely, Mel Frequency Cepstral Coefficients<span> (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Relative Spectral Trans-formed Perceptual Linear Prediction coefficients (RASTA-PLP), Amplitude Modulation<span> Spectrogram<span> (AMS), and Multi-Resolution Cochleagram (MRCG), on PESQ and STOI under varying noise conditions. Experimental outcomes underscore that MRCG consistently emerges as the optimal and most stable feature for STOI, while the feature yielding the highest PESQ score exhibits intricate correlations with the background noise type, SNR level, and noise compatibility with the speech signal. Consequently, we propose an SE methodology founded on quality assessment and feature selection, facilitating the adaptive selection of optimal features tailored to distinct background noise scenarios, thereby always maintain the highest caliber enhancement effect with regard to PESQ metrics.</span></span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"New research on monaural speech segregation based on quality assessment\",\"authors\":\"Xiaoping Xie, Can Li, Dan Tian, Rufeng Shen, Fei Ding\",\"doi\":\"10.1016/j.csl.2023.101601\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Speech enhancement (SE) is a pivotal technology in enhancing the quality and intelligibility of speech signals. Nevertheless, when processing speech signals under conditions of high signal-to-noise ratio (SNR), conventional SE techniques may inadvertently lead to a diminution in the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This article introduces the innovative incorporation of the Non-Intrusive Speech Quality Assessment (NISQA) algorithm into SE systems. Through the comparison of pre and post-enhancement speech quality scores, it discerns whether the speech signal under consideration warrants enhancement processing, thereby mitigating potential deterioration in PESQ and STOI. Furthermore, this study delves into the ramifications of five prevalent speech features, namely, Mel Frequency Cepstral Coefficients<span> (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Relative Spectral Trans-formed Perceptual Linear Prediction coefficients (RASTA-PLP), Amplitude Modulation<span> Spectrogram<span> (AMS), and Multi-Resolution Cochleagram (MRCG), on PESQ and STOI under varying noise conditions. Experimental outcomes underscore that MRCG consistently emerges as the optimal and most stable feature for STOI, while the feature yielding the highest PESQ score exhibits intricate correlations with the background noise type, SNR level, and noise compatibility with the speech signal. Consequently, we propose an SE methodology founded on quality assessment and feature selection, facilitating the adaptive selection of optimal features tailored to distinct background noise scenarios, thereby always maintain the highest caliber enhancement effect with regard to PESQ metrics.</span></span></span></p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2023-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230823001201\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823001201","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
New research on monaural speech segregation based on quality assessment
Speech enhancement (SE) is a pivotal technology in enhancing the quality and intelligibility of speech signals. Nevertheless, when processing speech signals under conditions of high signal-to-noise ratio (SNR), conventional SE techniques may inadvertently lead to a diminution in the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This article introduces the innovative incorporation of the Non-Intrusive Speech Quality Assessment (NISQA) algorithm into SE systems. Through the comparison of pre and post-enhancement speech quality scores, it discerns whether the speech signal under consideration warrants enhancement processing, thereby mitigating potential deterioration in PESQ and STOI. Furthermore, this study delves into the ramifications of five prevalent speech features, namely, Mel Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Relative Spectral Trans-formed Perceptual Linear Prediction coefficients (RASTA-PLP), Amplitude Modulation Spectrogram (AMS), and Multi-Resolution Cochleagram (MRCG), on PESQ and STOI under varying noise conditions. Experimental outcomes underscore that MRCG consistently emerges as the optimal and most stable feature for STOI, while the feature yielding the highest PESQ score exhibits intricate correlations with the background noise type, SNR level, and noise compatibility with the speech signal. Consequently, we propose an SE methodology founded on quality assessment and feature selection, facilitating the adaptive selection of optimal features tailored to distinct background noise scenarios, thereby always maintain the highest caliber enhancement effect with regard to PESQ metrics.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.