{"title":"Stacked U-Net with Time–Frequency Attention and Deep Connection Net for Single Channel Speech Enhancement","authors":"Veeraswamy Parisae, S. Nagakishore Bhavanam","doi":"10.1142/s0219467825500676","DOIUrl":null,"url":null,"abstract":"Deep neural networks have significantly promoted the progress of speech enhancement technology. However, a great number of speech enhancement approaches are unable to fully utilize context information from various scales, hindering performance enhancement. To tackle this issue, we introduce a method called TFADCSU-Net (Stacked U-Net with Time-Frequency Attention (TFA) and Deep Connection Layer (DCL)) for enhancing noisy speech in the time–frequency domain. TFADCSU-Net adopts an encoder-decoder structure with skip links. Within TFADCSU-Net, a multiscale feature extraction layer (MSFEL) is proposed to effectively capture contextual data from various scales. This allows us to leverage both global and local speech features to enhance the reconstruction of speech signals. Moreover, we incorporate deep connection layer and TFA mechanisms into the network to further improve feature extraction and aggregate utterance level context. The deep connection layer effectively captures rich and precise features by establishing direct connections starting from the initial layer to all subsequent layers, rather than relying on connections from earlier layers to subsequent layers. This approach not only enhances the information flow within the network but also avoids a significant rise in computational complexity as the number of network layers increases. The TFA module consists of two attention branches operating concurrently: one directed towards the temporal dimension and the other towards the frequency dimension. These branches generate distinct forms of attention — one for identifying relevant time frames and another for selecting frequency wise channels. These attention mechanisms assist the models in discerning “where” and “what” to prioritize. Subsequently, the TA and FA branches are combined to produce a comprehensive attention map in two dimensions. This map assigns specific attention weights to individual spectral components in the time–frequency representation, enabling the networks to proficiently capture the speech characteristics in the T-F representation. The results confirm that the proposed method outperforms other models in terms of objective speech quality as well as intelligibility.","PeriodicalId":44688,"journal":{"name":"International Journal of Image and Graphics","volume":null,"pages":null},"PeriodicalIF":0.8000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Image and Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219467825500676","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Deep neural networks have significantly promoted the progress of speech enhancement technology. However, a great number of speech enhancement approaches are unable to fully utilize context information from various scales, hindering performance enhancement. To tackle this issue, we introduce a method called TFADCSU-Net (Stacked U-Net with Time-Frequency Attention (TFA) and Deep Connection Layer (DCL)) for enhancing noisy speech in the time–frequency domain. TFADCSU-Net adopts an encoder-decoder structure with skip links. Within TFADCSU-Net, a multiscale feature extraction layer (MSFEL) is proposed to effectively capture contextual data from various scales. This allows us to leverage both global and local speech features to enhance the reconstruction of speech signals. Moreover, we incorporate deep connection layer and TFA mechanisms into the network to further improve feature extraction and aggregate utterance level context. The deep connection layer effectively captures rich and precise features by establishing direct connections starting from the initial layer to all subsequent layers, rather than relying on connections from earlier layers to subsequent layers. This approach not only enhances the information flow within the network but also avoids a significant rise in computational complexity as the number of network layers increases. The TFA module consists of two attention branches operating concurrently: one directed towards the temporal dimension and the other towards the frequency dimension. These branches generate distinct forms of attention — one for identifying relevant time frames and another for selecting frequency wise channels. These attention mechanisms assist the models in discerning “where” and “what” to prioritize. Subsequently, the TA and FA branches are combined to produce a comprehensive attention map in two dimensions. This map assigns specific attention weights to individual spectral components in the time–frequency representation, enabling the networks to proficiently capture the speech characteristics in the T-F representation. The results confirm that the proposed method outperforms other models in terms of objective speech quality as well as intelligibility.