Real-Time Single Channel Speech Enhancement Using Triple Attention and Stacked Squeeze-TCN

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Intelligence Pub Date : 2025-01-06 DOI:10.1111/coin.70016

Chaitanya Jannu, Manaswini Burra, Sunny Dayal Vanambathina, Veeraswamy Parisae

{"title":"Real-Time Single Channel Speech Enhancement Using Triple Attention and Stacked Squeeze-TCN","authors":"Chaitanya Jannu, Manaswini Burra, Sunny Dayal Vanambathina, Veeraswamy Parisae","doi":"10.1111/coin.70016","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Speech enhancement is crucial in many speech processing applications. Recently, researchers have been exploring ways to improve performance by effectively capturing the long-term contextual relationships within speech signals. Using multiple stages of learning, where several deep learning modules are activated one after the other, has been shown to be an effective approach. Recently, the attention mechanism has been explored for improving speech quality, showing significant improvements. The attention modules have been developed to improve CNNs backbone network performance. However, these attention modules often use fully connected (FC) and convolution layers, which increase the model's parameter count and computational requirements. The present study employs multi-stage learning within the framework of speech enhancement. The proposed study uses a multi-stage structure in which a sequence of Squeeze temporal convolutional modules (STCM) with twice dilation rates comes after a Triple attention block (TAB) at each stage. An estimate is generated at each phase and refined in the subsequent phase. To reintroduce the original information, a feature fusion module (FFM) is inserted at the beginning of each following phase. In the proposed model, the intermediate output can go through several phases of step-by-step improvement by continually unfolding STCMs, which eventually leads to the precise estimation of the spectrum. A TAB is crafted to enhance the model performance, allowing it to concurrently concentrate on areas of interest in the channel, spatial, and time-frequency dimensions. To be more specific, the CSA has two parallel regions combining channel with spatial attention, enabling both the channel dimension and the spatial dimension to be captured simultaneously. Next, the signal can be emphasized as a function of time and frequency by aggregating the feature maps along these dimensions. This improves its capability to model the temporal dependencies of speech signals. Using the VCTK and Librispeech datasets, the proposed speech enhancement system is assessed against state-of-the-art deep learning techniques and yielded better results in terms of PESQ, STOI, CSIG, CBAK, and COVL.</p>\n </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"41 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.70016","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Speech enhancement is crucial in many speech processing applications. Recently, researchers have been exploring ways to improve performance by effectively capturing the long-term contextual relationships within speech signals. Using multiple stages of learning, where several deep learning modules are activated one after the other, has been shown to be an effective approach. Recently, the attention mechanism has been explored for improving speech quality, showing significant improvements. The attention modules have been developed to improve CNNs backbone network performance. However, these attention modules often use fully connected (FC) and convolution layers, which increase the model's parameter count and computational requirements. The present study employs multi-stage learning within the framework of speech enhancement. The proposed study uses a multi-stage structure in which a sequence of Squeeze temporal convolutional modules (STCM) with twice dilation rates comes after a Triple attention block (TAB) at each stage. An estimate is generated at each phase and refined in the subsequent phase. To reintroduce the original information, a feature fusion module (FFM) is inserted at the beginning of each following phase. In the proposed model, the intermediate output can go through several phases of step-by-step improvement by continually unfolding STCMs, which eventually leads to the precise estimation of the spectrum. A TAB is crafted to enhance the model performance, allowing it to concurrently concentrate on areas of interest in the channel, spatial, and time-frequency dimensions. To be more specific, the CSA has two parallel regions combining channel with spatial attention, enabling both the channel dimension and the spatial dimension to be captured simultaneously. Next, the signal can be emphasized as a function of time and frequency by aggregating the feature maps along these dimensions. This improves its capability to model the temporal dependencies of speech signals. Using the VCTK and Librispeech datasets, the proposed speech enhancement system is assessed against state-of-the-art deep learning techniques and yielded better results in terms of PESQ, STOI, CSIG, CBAK, and COVL.

查看原文本刊更多论文

基于三重注意和堆叠压缩- tcn的实时单通道语音增强

语音增强在许多语音处理应用中是至关重要的。最近，研究人员一直在探索通过有效地捕捉语音信号中的长期上下文关系来提高表现的方法。使用多个学习阶段，其中几个深度学习模块被一个接一个地激活，已被证明是一种有效的方法。近年来，人们对注意机制在提高语音质量方面进行了探索，并取得了显著的进展。为了提高cnn骨干网的性能，开发了关注模块。然而，这些注意力模块通常使用全连接（FC）和卷积层，这增加了模型的参数计数和计算需求。本研究采用语音增强框架下的多阶段学习。该研究采用多阶段结构，在每个阶段的三重注意块（TAB）之后，会出现两次扩张率的挤压时间卷积模块（STCM）序列。在每个阶段生成评估，并在随后的阶段进行细化。为了重新引入原始信息，在每个阶段的开始插入特征融合模块（FFM）。在该模型中，通过不断展开stcm，中间输出可以经历几个阶段的逐步改进，最终得到精确的频谱估计。TAB是用来增强模型性能的，允许它同时专注于通道、空间和时间-频率维度中感兴趣的领域。更具体地说，CSA有两个平行的区域，结合了通道和空间注意，使通道维度和空间维度可以同时被捕获。接下来，可以通过沿着这些维度聚合特征映射来强调信号作为时间和频率的函数。这提高了其模拟语音信号时间依赖性的能力。使用VCTK和librisspeech数据集，所提出的语音增强系统与最先进的深度学习技术进行了评估，并在PESQ、STOI、CSIG、CBAK和COVL方面取得了更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Intelligence 工程技术-计算机：人工智能

CiteScore

6.90

自引率

3.60%

发文量

审稿时长

>12 weeks

期刊介绍： This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.