{"title":"利用基于嵌套 U 网模型的时频注意机制增强单声道语音","authors":"A. Prathipati, A.S.N. Chakravarthy","doi":"10.1088/2631-8695/ad5e36","DOIUrl":null,"url":null,"abstract":"\n Deep-learning models have used attention mechanisms to improve the quality and intelligibility of noisy speech, demonstrating the effectiveness of attention mechanisms. We rely on either spatial or temporal-based attention mechanisms, resulting in severe information loss. In this paper, a time-frequency attention mechanism with a nested U-network (TFANUNet) is proposed for single-channel speech enhancement. By using time-frequency attention (TFA), learns the channel, frequency and time information which is more significant for speech enhancement. Basically, the proposed model is an encoder-decoder model, where each layer in the encoder and decoder is followed by a nested dense residual dilated DensNet (NDRD) based multi-scale context aggression block. NDRD involves multiple dilated convolution with different dilatation factors to explore the large receptive area at different scales simultaneously. NDRD avoids the aliasing problem in DenseNet. We integrated the TFA and NDRD blocks into the proposed model to enable refined feature set extraction without information loss and utterance-level context aggregation, respectively. The proposed TFANUNet model results outperform baselines in terms of STOI and PESQ.","PeriodicalId":505725,"journal":{"name":"Engineering Research Express","volume":"34 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Single channel speech enhancement using time-frequency attention mechanism based nested U-net model\",\"authors\":\"A. Prathipati, A.S.N. Chakravarthy\",\"doi\":\"10.1088/2631-8695/ad5e36\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Deep-learning models have used attention mechanisms to improve the quality and intelligibility of noisy speech, demonstrating the effectiveness of attention mechanisms. We rely on either spatial or temporal-based attention mechanisms, resulting in severe information loss. In this paper, a time-frequency attention mechanism with a nested U-network (TFANUNet) is proposed for single-channel speech enhancement. By using time-frequency attention (TFA), learns the channel, frequency and time information which is more significant for speech enhancement. Basically, the proposed model is an encoder-decoder model, where each layer in the encoder and decoder is followed by a nested dense residual dilated DensNet (NDRD) based multi-scale context aggression block. NDRD involves multiple dilated convolution with different dilatation factors to explore the large receptive area at different scales simultaneously. NDRD avoids the aliasing problem in DenseNet. We integrated the TFA and NDRD blocks into the proposed model to enable refined feature set extraction without information loss and utterance-level context aggregation, respectively. The proposed TFANUNet model results outperform baselines in terms of STOI and PESQ.\",\"PeriodicalId\":505725,\"journal\":{\"name\":\"Engineering Research Express\",\"volume\":\"34 5\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Research Express\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1088/2631-8695/ad5e36\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Research Express","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/2631-8695/ad5e36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
深度学习模型利用注意力机制提高了噪声语音的质量和可懂度,证明了注意力机制的有效性。我们依赖基于空间或时间的注意力机制,结果造成了严重的信息损失。本文提出了一种具有嵌套 U 形网络(TFANUNet)的时频注意机制,用于单通道语音增强。通过使用时频注意(TFA),可以学习对语音增强更重要的信道、频率和时间信息。基本上,所提出的模型是一个编码器-解码器模型,其中编码器和解码器中的每一层后面都有一个嵌套的基于多尺度上下文侵略块的密集残差稀释 DensNet (NDRD)。NDRD 包括使用不同扩张因子的多重扩张卷积,以同时探索不同尺度的大感受区。NDRD 避免了 DenseNet 中的混叠问题。我们将 TFA 和 NDRD 模块集成到所提出的模型中,以分别实现无信息损失的精细特征集提取和语料级上下文聚合。就 STOI 和 PESQ 而言,拟议的 TFANUNet 模型结果优于基线。
Single channel speech enhancement using time-frequency attention mechanism based nested U-net model
Deep-learning models have used attention mechanisms to improve the quality and intelligibility of noisy speech, demonstrating the effectiveness of attention mechanisms. We rely on either spatial or temporal-based attention mechanisms, resulting in severe information loss. In this paper, a time-frequency attention mechanism with a nested U-network (TFANUNet) is proposed for single-channel speech enhancement. By using time-frequency attention (TFA), learns the channel, frequency and time information which is more significant for speech enhancement. Basically, the proposed model is an encoder-decoder model, where each layer in the encoder and decoder is followed by a nested dense residual dilated DensNet (NDRD) based multi-scale context aggression block. NDRD involves multiple dilated convolution with different dilatation factors to explore the large receptive area at different scales simultaneously. NDRD avoids the aliasing problem in DenseNet. We integrated the TFA and NDRD blocks into the proposed model to enable refined feature set extraction without information loss and utterance-level context aggregation, respectively. The proposed TFANUNet model results outperform baselines in terms of STOI and PESQ.