A ChannelWise weighting technique of slice-based Temporal Convolutional Network for noisy speech enhancement

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2023-10-11 DOI:10.1016/j.csl.2023.101572

Wei-Tyng Hong, Kuldeep Singh Rana

{"title":"A ChannelWise weighting technique of slice-based Temporal Convolutional Network for noisy speech enhancement","authors":"Wei-Tyng Hong, Kuldeep Singh Rana","doi":"10.1016/j.csl.2023.101572","DOIUrl":null,"url":null,"abstract":"<div><p><span>In recent years, Temporal Convolutional Networks<span> (TCNs) have driven significant progress in single-channel noisy speech enhancement. However, TCN-based systems still face certain challenges, such as limited utilization of network channel depth for handling long-range dependencies and issues with weight sharing. To address these challenges, this paper proposes a novel channel-wise weighting scheme, specifically designed for the sliced TCN framework. The proposed scheme involves the element-wise multiplication of shifting weight techniques for each channel of the TCN slice. Utilizing a cyclically shifted approach, these weights capture information from neighboring channels, uncovering the dependencies between adjacent channels. By combining the channel-wise weighted TCN output and subsequently estimating a masking function, the proposed method effectively suppresses noise components, leading to enhanced speech quality. To train and evaluate our proposed method, we utilize speech datasets that consist of various noise types at different levels. To optimize the performance of the proposed end-to-end enhancement system, we adopt the Scale-Invariant Signal-to-Noise Ratio (SI-SNR) objective function. Experimental results demonstrate the effectiveness of our proposed TCN channel-wise weighting method, with a significant average improvement of approximately 9.8% in SI-SNR for the unseen noise dataset. This improvement was observed at an SNR of </span></span><span><math><mo>−</mo></math></span>3 dB for both non-channel-wise weighting schemes and the proposed channel-wise weighting schemes within the Multi-slicing TCNs framework. The main advantage of the proposed approach is its ability to address the challenges of uneven and biased output from TCN slices, particularly when dealing with highly non-stationary, noisy speech signals infused with speech-like noise. This leads to more robust performance in various real-world applications.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823000918","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, Temporal Convolutional Networks (TCNs) have driven significant progress in single-channel noisy speech enhancement. However, TCN-based systems still face certain challenges, such as limited utilization of network channel depth for handling long-range dependencies and issues with weight sharing. To address these challenges, this paper proposes a novel channel-wise weighting scheme, specifically designed for the sliced TCN framework. The proposed scheme involves the element-wise multiplication of shifting weight techniques for each channel of the TCN slice. Utilizing a cyclically shifted approach, these weights capture information from neighboring channels, uncovering the dependencies between adjacent channels. By combining the channel-wise weighted TCN output and subsequently estimating a masking function, the proposed method effectively suppresses noise components, leading to enhanced speech quality. To train and evaluate our proposed method, we utilize speech datasets that consist of various noise types at different levels. To optimize the performance of the proposed end-to-end enhancement system, we adopt the Scale-Invariant Signal-to-Noise Ratio (SI-SNR) objective function. Experimental results demonstrate the effectiveness of our proposed TCN channel-wise weighting method, with a significant average improvement of approximately 9.8% in SI-SNR for the unseen noise dataset. This improvement was observed at an SNR of $-$ 3 dB for both non-channel-wise weighting schemes and the proposed channel-wise weighting schemes within the Multi-slicing TCNs framework. The main advantage of the proposed approach is its ability to address the challenges of uneven and biased output from TCN slices, particularly when dealing with highly non-stationary, noisy speech signals infused with speech-like noise. This leads to more robust performance in various real-world applications.

查看原文本刊更多论文

基于切片的时域卷积网络的ChannelWise加权技术在噪声语音增强中的应用

近年来，时间卷积网络（TCN）在单通道噪声语音增强方面取得了重大进展。然而，基于TCN的系统仍然面临某些挑战，例如用于处理长程依赖关系的网络信道深度利用率有限以及权重共享问题。为了应对这些挑战，本文提出了一种新的信道加权方案，专门为切片TCN框架设计。所提出的方案涉及TCN片的每个信道的移位权重技术的逐元素乘法。利用循环移位的方法，这些权重从相邻信道捕获信息，揭示相邻信道之间的相关性。通过组合按信道加权的TCN输出并随后估计掩蔽函数，所提出的方法有效地抑制了噪声分量，从而提高了语音质量。为了训练和评估我们提出的方法，我们使用了由不同级别的各种噪声类型组成的语音数据集。为了优化所提出的端到端增强系统的性能，我们采用了尺度不变信噪比（SI-SNR）目标函数。实验结果证明了我们提出的TCN信道加权方法的有效性，对于看不见的噪声数据集，SI-SNR的平均显著提高了约9.8%。对于多切片TCN框架内的非信道加权方案和所提出的信道加权方案，在−3 dB的SNR下都观察到了这种改进。所提出的方法的主要优点是它能够解决TCN切片输出不均匀和有偏差的挑战，特别是在处理充满类语音噪声的高度非平稳、有噪声的语音信号时。这将在各种实际应用程序中带来更强健的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.