{"title":"基于类人感知注意力的单耳语音增强改进编码器-解码器结构","authors":"Hao Zhou;Yi Zhou;Zhenhua Cheng;Yu Zhao;Yin Liu","doi":"10.1109/LSP.2025.3558690","DOIUrl":null,"url":null,"abstract":"Speech enhancement (SE) models based on deep neural networks (DNNs) have shown excellent denoising performance. However, mainstream SE models often have high structural complexity and large parameter sizes, requiring substantial computational resources, which limits their practical application. In this paper, a high-efficiency encoder-decoder structure, inspired by the top-down attention mechanism in human brain perception and named human-like perception attention network (HPANet), is proposed for monaural speech enhancement, which is able to emulate brain perceptual attention in noise environments. In HPANet, the raw waveform is first encoded by using attention encoder to capture shallow global features. These features are then downsampled, and multi-scale information is aggregated through top attention module to prevent the loss of crucial information. Next, down attention module integrates features from neighboring layers to reconstruct signal in a top-down manner. Finally, the decoder reconstructs the denoised clean signal. Experiments show that the proposed method effectively reduces model complexity while maintaining competitive performance.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"1670-1674"},"PeriodicalIF":3.2000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improved Encoder-Decoder Architecture With Human-Like Perception Attention for Monaural Speech Enhancement\",\"authors\":\"Hao Zhou;Yi Zhou;Zhenhua Cheng;Yu Zhao;Yin Liu\",\"doi\":\"10.1109/LSP.2025.3558690\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech enhancement (SE) models based on deep neural networks (DNNs) have shown excellent denoising performance. However, mainstream SE models often have high structural complexity and large parameter sizes, requiring substantial computational resources, which limits their practical application. In this paper, a high-efficiency encoder-decoder structure, inspired by the top-down attention mechanism in human brain perception and named human-like perception attention network (HPANet), is proposed for monaural speech enhancement, which is able to emulate brain perceptual attention in noise environments. In HPANet, the raw waveform is first encoded by using attention encoder to capture shallow global features. These features are then downsampled, and multi-scale information is aggregated through top attention module to prevent the loss of crucial information. Next, down attention module integrates features from neighboring layers to reconstruct signal in a top-down manner. Finally, the decoder reconstructs the denoised clean signal. Experiments show that the proposed method effectively reduces model complexity while maintaining competitive performance.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"32 \",\"pages\":\"1670-1674\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955229/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10955229/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Improved Encoder-Decoder Architecture With Human-Like Perception Attention for Monaural Speech Enhancement
Speech enhancement (SE) models based on deep neural networks (DNNs) have shown excellent denoising performance. However, mainstream SE models often have high structural complexity and large parameter sizes, requiring substantial computational resources, which limits their practical application. In this paper, a high-efficiency encoder-decoder structure, inspired by the top-down attention mechanism in human brain perception and named human-like perception attention network (HPANet), is proposed for monaural speech enhancement, which is able to emulate brain perceptual attention in noise environments. In HPANet, the raw waveform is first encoded by using attention encoder to capture shallow global features. These features are then downsampled, and multi-scale information is aggregated through top attention module to prevent the loss of crucial information. Next, down attention module integrates features from neighboring layers to reconstruct signal in a top-down manner. Finally, the decoder reconstructs the denoised clean signal. Experiments show that the proposed method effectively reduces model complexity while maintaining competitive performance.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.