基于质量评价的单音语音分离新研究

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2023-12-05 DOI:10.1016/j.csl.2023.101601

Xiaoping Xie, Can Li, Dan Tian, Rufeng Shen, Fei Ding

{"title":"基于质量评价的单音语音分离新研究","authors":"Xiaoping Xie, Can Li, Dan Tian, Rufeng Shen, Fei Ding","doi":"10.1016/j.csl.2023.101601","DOIUrl":null,"url":null,"abstract":"<div><p>Speech enhancement (SE) is a pivotal technology in enhancing the quality and intelligibility of speech signals. Nevertheless, when processing speech signals under conditions of high signal-to-noise ratio (SNR), conventional SE techniques may inadvertently lead to a diminution in the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This article introduces the innovative incorporation of the Non-Intrusive Speech Quality Assessment (NISQA) algorithm into SE systems. Through the comparison of pre and post-enhancement speech quality scores, it discerns whether the speech signal under consideration warrants enhancement processing, thereby mitigating potential deterioration in PESQ and STOI. Furthermore, this study delves into the ramifications of five prevalent speech features, namely, Mel Frequency Cepstral Coefficients<span> (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Relative Spectral Trans-formed Perceptual Linear Prediction coefficients (RASTA-PLP), Amplitude Modulation<span> Spectrogram<span> (AMS), and Multi-Resolution Cochleagram (MRCG), on PESQ and STOI under varying noise conditions. Experimental outcomes underscore that MRCG consistently emerges as the optimal and most stable feature for STOI, while the feature yielding the highest PESQ score exhibits intricate correlations with the background noise type, SNR level, and noise compatibility with the speech signal. Consequently, we propose an SE methodology founded on quality assessment and feature selection, facilitating the adaptive selection of optimal features tailored to distinct background noise scenarios, thereby always maintain the highest caliber enhancement effect with regard to PESQ metrics.</span></span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"New research on monaural speech segregation based on quality assessment\",\"authors\":\"Xiaoping Xie, Can Li, Dan Tian, Rufeng Shen, Fei Ding\",\"doi\":\"10.1016/j.csl.2023.101601\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Speech enhancement (SE) is a pivotal technology in enhancing the quality and intelligibility of speech signals. Nevertheless, when processing speech signals under conditions of high signal-to-noise ratio (SNR), conventional SE techniques may inadvertently lead to a diminution in the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This article introduces the innovative incorporation of the Non-Intrusive Speech Quality Assessment (NISQA) algorithm into SE systems. Through the comparison of pre and post-enhancement speech quality scores, it discerns whether the speech signal under consideration warrants enhancement processing, thereby mitigating potential deterioration in PESQ and STOI. Furthermore, this study delves into the ramifications of five prevalent speech features, namely, Mel Frequency Cepstral Coefficients<span> (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Relative Spectral Trans-formed Perceptual Linear Prediction coefficients (RASTA-PLP), Amplitude Modulation<span> Spectrogram<span> (AMS), and Multi-Resolution Cochleagram (MRCG), on PESQ and STOI under varying noise conditions. Experimental outcomes underscore that MRCG consistently emerges as the optimal and most stable feature for STOI, while the feature yielding the highest PESQ score exhibits intricate correlations with the background noise type, SNR level, and noise compatibility with the speech signal. Consequently, we propose an SE methodology founded on quality assessment and feature selection, facilitating the adaptive selection of optimal features tailored to distinct background noise scenarios, thereby always maintain the highest caliber enhancement effect with regard to PESQ metrics.</span></span></span></p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2023-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230823001201\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823001201","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

语音增强技术是提高语音信号质量和可理解性的关键技术。然而，当处理高信噪比(SNR)条件下的语音信号时，传统的SE技术可能会无意中导致语音质量(PESQ)和短时客观可理解性(STOI)的感知评价的降低。本文介绍了将非侵入式语音质量评估(NISQA)算法创新性地整合到语音识别系统中。通过比较增强前和增强后的语音质量分数，可以判断所考虑的语音信号是否需要进行增强处理，从而减轻PESQ和STOI的潜在恶化。此外，本研究还探讨了五种常见的语音特征，即Mel频率倒谱系数(MFCC)、gamma酮频率倒谱系数(GFCC)、相对频谱变换感知线性预测系数(RASTA-PLP)、调幅谱图(AMS)和多分辨率耳蜗图(MRCG)在不同噪声条件下对PESQ和STOI的影响。实验结果强调，MRCG始终是STOI的最佳和最稳定的特征，而产生最高PESQ分数的特征与背景噪声类型、信噪比水平以及与语音信号的噪声兼容性表现出复杂的相关性。因此，我们提出了一种基于质量评估和特征选择的SE方法，促进了针对不同背景噪声场景的最佳特征的自适应选择，从而始终保持最高水平的PESQ指标增强效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

New research on monaural speech segregation based on quality assessment

Speech enhancement (SE) is a pivotal technology in enhancing the quality and intelligibility of speech signals. Nevertheless, when processing speech signals under conditions of high signal-to-noise ratio (SNR), conventional SE techniques may inadvertently lead to a diminution in the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This article introduces the innovative incorporation of the Non-Intrusive Speech Quality Assessment (NISQA) algorithm into SE systems. Through the comparison of pre and post-enhancement speech quality scores, it discerns whether the speech signal under consideration warrants enhancement processing, thereby mitigating potential deterioration in PESQ and STOI. Furthermore, this study delves into the ramifications of five prevalent speech features, namely, Mel Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Relative Spectral Trans-formed Perceptual Linear Prediction coefficients (RASTA-PLP), Amplitude Modulation Spectrogram (AMS), and Multi-Resolution Cochleagram (MRCG), on PESQ and STOI under varying noise conditions. Experimental outcomes underscore that MRCG consistently emerges as the optimal and most stable feature for STOI, while the feature yielding the highest PESQ score exhibits intricate correlations with the background noise type, SNR level, and noise compatibility with the speech signal. Consequently, we propose an SE methodology founded on quality assessment and feature selection, facilitating the adaptive selection of optimal features tailored to distinct background noise scenarios, thereby always maintain the highest caliber enhancement effect with regard to PESQ metrics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.