Xiaotong Tu , Jiaxin Xie , Yijin Mao , Yue Huang , Xinghao Ding , Shaogan Ye
{"title":"Multi-information-aware speech enhancement through self-supervised learning","authors":"Xiaotong Tu , Jiaxin Xie , Yijin Mao , Yue Huang , Xinghao Ding , Shaogan Ye","doi":"10.1016/j.dsp.2025.105464","DOIUrl":null,"url":null,"abstract":"<div><div>Speech enhancement is a crucial technology aimed at improving the quality and intelligibility of speech signals in noisy environments. Recent advancements in deep neural networks have leveraged abundant clean speech datasets for supervised learning with remarkable results. However, supervised models suffer from poor robustness and generalization due to the scarcity of clean speech data and the complexity of the noise distribution in the real world. In this paper, a self-supervised speech enhancement model, called Multi-Information-Aware Speech Enhancement (MIA-SE), is proposed to address these challenges. A novel self-supervised training strategy is introduced in which denoising is performed on a single input twice, with the first denoiser output being employed as an Implicit Deep Denoiser Prior (IDDP) to supervise the subsequent denoising process. Furthermore, an encoder–decoder denoiser architecture based on a complex ratio masking strategy is incorporated to extract phase and magnitude features simultaneously. To capture sequence context information for improved embedding, transformer modules with multi-head attention mechanisms are integrated within the denoiser. The training process is guided by a newly formulated loss function to ensure successful and effective learning. Experimental results on synthetic and real-world noise databases demonstrate the effectiveness of MIA-SE, particularly in scenarios where paired training data is unavailable.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"168 ","pages":"Article 105464"},"PeriodicalIF":2.9000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425004865","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Speech enhancement is a crucial technology aimed at improving the quality and intelligibility of speech signals in noisy environments. Recent advancements in deep neural networks have leveraged abundant clean speech datasets for supervised learning with remarkable results. However, supervised models suffer from poor robustness and generalization due to the scarcity of clean speech data and the complexity of the noise distribution in the real world. In this paper, a self-supervised speech enhancement model, called Multi-Information-Aware Speech Enhancement (MIA-SE), is proposed to address these challenges. A novel self-supervised training strategy is introduced in which denoising is performed on a single input twice, with the first denoiser output being employed as an Implicit Deep Denoiser Prior (IDDP) to supervise the subsequent denoising process. Furthermore, an encoder–decoder denoiser architecture based on a complex ratio masking strategy is incorporated to extract phase and magnitude features simultaneously. To capture sequence context information for improved embedding, transformer modules with multi-head attention mechanisms are integrated within the denoiser. The training process is guided by a newly formulated loss function to ensure successful and effective learning. Experimental results on synthetic and real-world noise databases demonstrate the effectiveness of MIA-SE, particularly in scenarios where paired training data is unavailable.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,