混响和混响噪声环境中基于 DNN 的语音增强系统研究

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-06-06 DOI:10.1016/j.csl.2024.101677

Heming Wang , Ashutosh Pandey , DeLiang Wang

{"title":"混响和混响噪声环境中基于 DNN 的语音增强系统研究","authors":"Heming Wang , Ashutosh Pandey , DeLiang Wang","doi":"10.1016/j.csl.2024.101677","DOIUrl":null,"url":null,"abstract":"<div><p>Deep learning has led to dramatic performance improvements for the task of speech enhancement, where deep neural networks (DNNs) are trained to recover clean speech from noisy and reverberant mixtures. Most of the existing DNN-based algorithms operate in the frequency domain, as time-domain approaches are believed to be less effective for speech dereverberation. In this study, we employ two DNNs: ARN (attentive recurrent network) and DC-CRN (densely-connected convolutional recurrent network), and systematically investigate the effects of different components on enhancement performance, such as window sizes, loss functions, and feature representations. We conduct evaluation experiments in two main conditions: reverberant-only and reverberant-noisy. Our findings suggest that incorporating larger window sizes is helpful for dereverberation, and adding transform operations (either convolutional or linear) to encode and decode waveform features improves the sparsity of the learned representations, and boosts the performance of time-domain models. Experimental results demonstrate that ARN and DC-CRN with proposed techniques achieve superior performance compared with other strong enhancement baselines.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101677"},"PeriodicalIF":3.1000,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000603/pdfft?md5=6f57ae0077f304562bdf74000559d71d&pid=1-s2.0-S0885230824000603-main.pdf","citationCount":"0","resultStr":"{\"title\":\"A systematic study of DNN based speech enhancement in reverberant and reverberant-noisy environments\",\"authors\":\"Heming Wang , Ashutosh Pandey , DeLiang Wang\",\"doi\":\"10.1016/j.csl.2024.101677\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Deep learning has led to dramatic performance improvements for the task of speech enhancement, where deep neural networks (DNNs) are trained to recover clean speech from noisy and reverberant mixtures. Most of the existing DNN-based algorithms operate in the frequency domain, as time-domain approaches are believed to be less effective for speech dereverberation. In this study, we employ two DNNs: ARN (attentive recurrent network) and DC-CRN (densely-connected convolutional recurrent network), and systematically investigate the effects of different components on enhancement performance, such as window sizes, loss functions, and feature representations. We conduct evaluation experiments in two main conditions: reverberant-only and reverberant-noisy. Our findings suggest that incorporating larger window sizes is helpful for dereverberation, and adding transform operations (either convolutional or linear) to encode and decode waveform features improves the sparsity of the learned representations, and boosts the performance of time-domain models. Experimental results demonstrate that ARN and DC-CRN with proposed techniques achieve superior performance compared with other strong enhancement baselines.</p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"89 \",\"pages\":\"Article 101677\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000603/pdfft?md5=6f57ae0077f304562bdf74000559d71d&pid=1-s2.0-S0885230824000603-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000603\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000603","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

深度学习极大地提高了语音增强任务的性能，通过对深度神经网络（DNN）进行训练，可以从噪声和混响混合物中恢复干净的语音。现有的基于 DNN 的算法大多在频域运行，而时域方法被认为对语音消除混响效果较差。在本研究中，我们采用了两种 DNN：ARN（殷勤递归网络）和 DC-CRN（密集连接卷积递归网络），并系统地研究了不同组件对增强性能的影响，如窗口大小、损失函数和特征表示。我们在两种主要条件下进行了评估实验：纯混响和混响噪声。我们的研究结果表明，采用更大的窗口尺寸有助于消除混响，而增加变换操作（卷积或线性）来编码和解码波形特征，则能改善所学表征的稀疏性，并提高时域模型的性能。实验结果表明，与其他强增强基线相比，采用了建议技术的 ARN 和 DC-CRN 性能更优。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A systematic study of DNN based speech enhancement in reverberant and reverberant-noisy environments

Deep learning has led to dramatic performance improvements for the task of speech enhancement, where deep neural networks (DNNs) are trained to recover clean speech from noisy and reverberant mixtures. Most of the existing DNN-based algorithms operate in the frequency domain, as time-domain approaches are believed to be less effective for speech dereverberation. In this study, we employ two DNNs: ARN (attentive recurrent network) and DC-CRN (densely-connected convolutional recurrent network), and systematically investigate the effects of different components on enhancement performance, such as window sizes, loss functions, and feature representations. We conduct evaluation experiments in two main conditions: reverberant-only and reverberant-noisy. Our findings suggest that incorporating larger window sizes is helpful for dereverberation, and adding transform operations (either convolutional or linear) to encode and decode waveform features improves the sparsity of the learned representations, and boosts the performance of time-domain models. Experimental results demonstrate that ARN and DC-CRN with proposed techniques achieve superior performance compared with other strong enhancement baselines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.