基于类人感知注意力的单耳语音增强改进编码器-解码器结构

IF 3.2 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Signal Processing Letters Pub Date : 2025-04-07 DOI:10.1109/LSP.2025.3558690

Hao Zhou;Yi Zhou;Zhenhua Cheng;Yu Zhao;Yin Liu

{"title":"基于类人感知注意力的单耳语音增强改进编码器-解码器结构","authors":"Hao Zhou;Yi Zhou;Zhenhua Cheng;Yu Zhao;Yin Liu","doi":"10.1109/LSP.2025.3558690","DOIUrl":null,"url":null,"abstract":"Speech enhancement (SE) models based on deep neural networks (DNNs) have shown excellent denoising performance. However, mainstream SE models often have high structural complexity and large parameter sizes, requiring substantial computational resources, which limits their practical application. In this paper, a high-efficiency encoder-decoder structure, inspired by the top-down attention mechanism in human brain perception and named human-like perception attention network (HPANet), is proposed for monaural speech enhancement, which is able to emulate brain perceptual attention in noise environments. In HPANet, the raw waveform is first encoded by using attention encoder to capture shallow global features. These features are then downsampled, and multi-scale information is aggregated through top attention module to prevent the loss of crucial information. Next, down attention module integrates features from neighboring layers to reconstruct signal in a top-down manner. Finally, the decoder reconstructs the denoised clean signal. Experiments show that the proposed method effectively reduces model complexity while maintaining competitive performance.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"1670-1674"},"PeriodicalIF":3.2000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improved Encoder-Decoder Architecture With Human-Like Perception Attention for Monaural Speech Enhancement\",\"authors\":\"Hao Zhou;Yi Zhou;Zhenhua Cheng;Yu Zhao;Yin Liu\",\"doi\":\"10.1109/LSP.2025.3558690\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech enhancement (SE) models based on deep neural networks (DNNs) have shown excellent denoising performance. However, mainstream SE models often have high structural complexity and large parameter sizes, requiring substantial computational resources, which limits their practical application. In this paper, a high-efficiency encoder-decoder structure, inspired by the top-down attention mechanism in human brain perception and named human-like perception attention network (HPANet), is proposed for monaural speech enhancement, which is able to emulate brain perceptual attention in noise environments. In HPANet, the raw waveform is first encoded by using attention encoder to capture shallow global features. These features are then downsampled, and multi-scale information is aggregated through top attention module to prevent the loss of crucial information. Next, down attention module integrates features from neighboring layers to reconstruct signal in a top-down manner. Finally, the decoder reconstructs the denoised clean signal. Experiments show that the proposed method effectively reduces model complexity while maintaining competitive performance.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"32 \",\"pages\":\"1670-1674\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955229/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10955229/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

基于深度神经网络（dnn）的语音增强（SE）模型显示出优异的去噪性能。然而，主流SE模型往往具有较高的结构复杂性和较大的参数尺寸，需要大量的计算资源，这限制了它们的实际应用。本文以人类大脑感知中自上而下的注意机制为灵感，提出了一种高效的编解码器结构——类人感知注意网络（human-仿人感知注意网络，HPANet），用于单耳语音增强，能够模拟噪声环境下大脑的感知注意。在HPANet中，首先使用注意力编码器对原始波形进行编码，以捕获浅层全局特征。然后对这些特征进行下采样，并通过顶部关注模块对多尺度信息进行聚合，以防止关键信息的丢失。接下来，下注意模块整合相邻层的特征，自上而下重构信号。最后，解码器重建去噪后的干净信号。实验表明，该方法在保持竞争性能的同时，有效地降低了模型复杂度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improved Encoder-Decoder Architecture With Human-Like Perception Attention for Monaural Speech Enhancement

Speech enhancement (SE) models based on deep neural networks (DNNs) have shown excellent denoising performance. However, mainstream SE models often have high structural complexity and large parameter sizes, requiring substantial computational resources, which limits their practical application. In this paper, a high-efficiency encoder-decoder structure, inspired by the top-down attention mechanism in human brain perception and named human-like perception attention network (HPANet), is proposed for monaural speech enhancement, which is able to emulate brain perceptual attention in noise environments. In HPANet, the raw waveform is first encoded by using attention encoder to capture shallow global features. These features are then downsampled, and multi-scale information is aggregated through top attention module to prevent the loss of crucial information. Next, down attention module integrates features from neighboring layers to reconstruct signal in a top-down manner. Finally, the decoder reconstructs the denoised clean signal. Experiments show that the proposed method effectively reduces model complexity while maintaining competitive performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Signal Processing Letters 工程技术-工程：电子与电气

CiteScore

7.40

自引率

12.80%

发文量

339

审稿时长

2.8 months

期刊介绍： The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.