{"title":"用于单声道语音增强的具有时频注意力和深度连接网的叠加 U 网","authors":"Veeraswamy Parisae, S. Nagakishore Bhavanam","doi":"10.1142/s0219467825500676","DOIUrl":null,"url":null,"abstract":"Deep neural networks have significantly promoted the progress of speech enhancement technology. However, a great number of speech enhancement approaches are unable to fully utilize context information from various scales, hindering performance enhancement. To tackle this issue, we introduce a method called TFADCSU-Net (Stacked U-Net with Time-Frequency Attention (TFA) and Deep Connection Layer (DCL)) for enhancing noisy speech in the time–frequency domain. TFADCSU-Net adopts an encoder-decoder structure with skip links. Within TFADCSU-Net, a multiscale feature extraction layer (MSFEL) is proposed to effectively capture contextual data from various scales. This allows us to leverage both global and local speech features to enhance the reconstruction of speech signals. Moreover, we incorporate deep connection layer and TFA mechanisms into the network to further improve feature extraction and aggregate utterance level context. The deep connection layer effectively captures rich and precise features by establishing direct connections starting from the initial layer to all subsequent layers, rather than relying on connections from earlier layers to subsequent layers. This approach not only enhances the information flow within the network but also avoids a significant rise in computational complexity as the number of network layers increases. The TFA module consists of two attention branches operating concurrently: one directed towards the temporal dimension and the other towards the frequency dimension. These branches generate distinct forms of attention — one for identifying relevant time frames and another for selecting frequency wise channels. These attention mechanisms assist the models in discerning “where” and “what” to prioritize. Subsequently, the TA and FA branches are combined to produce a comprehensive attention map in two dimensions. This map assigns specific attention weights to individual spectral components in the time–frequency representation, enabling the networks to proficiently capture the speech characteristics in the T-F representation. The results confirm that the proposed method outperforms other models in terms of objective speech quality as well as intelligibility.","PeriodicalId":44688,"journal":{"name":"International Journal of Image and Graphics","volume":null,"pages":null},"PeriodicalIF":0.8000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Stacked U-Net with Time–Frequency Attention and Deep Connection Net for Single Channel Speech Enhancement\",\"authors\":\"Veeraswamy Parisae, S. Nagakishore Bhavanam\",\"doi\":\"10.1142/s0219467825500676\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep neural networks have significantly promoted the progress of speech enhancement technology. However, a great number of speech enhancement approaches are unable to fully utilize context information from various scales, hindering performance enhancement. To tackle this issue, we introduce a method called TFADCSU-Net (Stacked U-Net with Time-Frequency Attention (TFA) and Deep Connection Layer (DCL)) for enhancing noisy speech in the time–frequency domain. TFADCSU-Net adopts an encoder-decoder structure with skip links. Within TFADCSU-Net, a multiscale feature extraction layer (MSFEL) is proposed to effectively capture contextual data from various scales. This allows us to leverage both global and local speech features to enhance the reconstruction of speech signals. Moreover, we incorporate deep connection layer and TFA mechanisms into the network to further improve feature extraction and aggregate utterance level context. The deep connection layer effectively captures rich and precise features by establishing direct connections starting from the initial layer to all subsequent layers, rather than relying on connections from earlier layers to subsequent layers. This approach not only enhances the information flow within the network but also avoids a significant rise in computational complexity as the number of network layers increases. The TFA module consists of two attention branches operating concurrently: one directed towards the temporal dimension and the other towards the frequency dimension. These branches generate distinct forms of attention — one for identifying relevant time frames and another for selecting frequency wise channels. These attention mechanisms assist the models in discerning “where” and “what” to prioritize. Subsequently, the TA and FA branches are combined to produce a comprehensive attention map in two dimensions. This map assigns specific attention weights to individual spectral components in the time–frequency representation, enabling the networks to proficiently capture the speech characteristics in the T-F representation. The results confirm that the proposed method outperforms other models in terms of objective speech quality as well as intelligibility.\",\"PeriodicalId\":44688,\"journal\":{\"name\":\"International Journal of Image and Graphics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2024-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Image and Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/s0219467825500676\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Image and Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219467825500676","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
摘要
深度神经网络极大地推动了语音增强技术的进步。然而,大量语音增强方法无法充分利用各种尺度的上下文信息,从而阻碍了性能的提升。为解决这一问题,我们引入了一种名为 TFADCSU-Net (Stacked U-Net with Time-Frequency Attention (TFA) and Deep Connection Layer (DCL))的方法,用于增强时频域的噪声语音。TFADCSU-Net 采用带跳过链接的编码器-解码器结构。在 TFADCSU-Net 中,我们提出了多尺度特征提取层 (MSFEL),以有效捕捉来自不同尺度的上下文数据。这样,我们就能利用全局和局部语音特征来增强语音信号的重构。此外,我们还在网络中加入了深度连接层和 TFA 机制,以进一步改进特征提取和语句级上下文聚合。深度连接层通过建立从初始层到所有后续层的直接连接,而不是依赖于从早期层到后续层的连接,从而有效地捕捉丰富而精确的特征。这种方法不仅增强了网络内的信息流,还避免了因网络层数增加而导致的计算复杂度大幅上升。TFA 模块由两个同时运行的注意力分支组成:一个针对时间维度,另一个针对频率维度。这些分支产生了不同形式的注意力--一种用于识别相关的时间框架,另一种用于选择频率明智的通道。这些注意机制有助于模型辨别 "哪里 "和 "什么 "需要优先处理。随后,TA 和 FA 分支结合在一起,生成一个两维的综合注意力地图。该图谱为时频表征中的各个频谱成分分配了特定的注意力权重,使网络能够熟练捕捉时频表征中的语音特征。结果证实,就客观语音质量和可懂度而言,所提出的方法优于其他模型。
Stacked U-Net with Time–Frequency Attention and Deep Connection Net for Single Channel Speech Enhancement
Deep neural networks have significantly promoted the progress of speech enhancement technology. However, a great number of speech enhancement approaches are unable to fully utilize context information from various scales, hindering performance enhancement. To tackle this issue, we introduce a method called TFADCSU-Net (Stacked U-Net with Time-Frequency Attention (TFA) and Deep Connection Layer (DCL)) for enhancing noisy speech in the time–frequency domain. TFADCSU-Net adopts an encoder-decoder structure with skip links. Within TFADCSU-Net, a multiscale feature extraction layer (MSFEL) is proposed to effectively capture contextual data from various scales. This allows us to leverage both global and local speech features to enhance the reconstruction of speech signals. Moreover, we incorporate deep connection layer and TFA mechanisms into the network to further improve feature extraction and aggregate utterance level context. The deep connection layer effectively captures rich and precise features by establishing direct connections starting from the initial layer to all subsequent layers, rather than relying on connections from earlier layers to subsequent layers. This approach not only enhances the information flow within the network but also avoids a significant rise in computational complexity as the number of network layers increases. The TFA module consists of two attention branches operating concurrently: one directed towards the temporal dimension and the other towards the frequency dimension. These branches generate distinct forms of attention — one for identifying relevant time frames and another for selecting frequency wise channels. These attention mechanisms assist the models in discerning “where” and “what” to prioritize. Subsequently, the TA and FA branches are combined to produce a comprehensive attention map in two dimensions. This map assigns specific attention weights to individual spectral components in the time–frequency representation, enabling the networks to proficiently capture the speech characteristics in the T-F representation. The results confirm that the proposed method outperforms other models in terms of objective speech quality as well as intelligibility.