Multi-information-aware speech enhancement through self-supervised learning

IF 2.9 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Digital Signal Processing Pub Date : 2025-07-15 DOI:10.1016/j.dsp.2025.105464

Xiaotong Tu , Jiaxin Xie , Yijin Mao , Yue Huang , Xinghao Ding , Shaogan Ye

{"title":"Multi-information-aware speech enhancement through self-supervised learning","authors":"Xiaotong Tu , Jiaxin Xie , Yijin Mao , Yue Huang , Xinghao Ding , Shaogan Ye","doi":"10.1016/j.dsp.2025.105464","DOIUrl":null,"url":null,"abstract":"<div><div>Speech enhancement is a crucial technology aimed at improving the quality and intelligibility of speech signals in noisy environments. Recent advancements in deep neural networks have leveraged abundant clean speech datasets for supervised learning with remarkable results. However, supervised models suffer from poor robustness and generalization due to the scarcity of clean speech data and the complexity of the noise distribution in the real world. In this paper, a self-supervised speech enhancement model, called Multi-Information-Aware Speech Enhancement (MIA-SE), is proposed to address these challenges. A novel self-supervised training strategy is introduced in which denoising is performed on a single input twice, with the first denoiser output being employed as an Implicit Deep Denoiser Prior (IDDP) to supervise the subsequent denoising process. Furthermore, an encoder–decoder denoiser architecture based on a complex ratio masking strategy is incorporated to extract phase and magnitude features simultaneously. To capture sequence context information for improved embedding, transformer modules with multi-head attention mechanisms are integrated within the denoiser. The training process is guided by a newly formulated loss function to ensure successful and effective learning. Experimental results on synthetic and real-world noise databases demonstrate the effectiveness of MIA-SE, particularly in scenarios where paired training data is unavailable.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"168 ","pages":"Article 105464"},"PeriodicalIF":2.9000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425004865","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Speech enhancement is a crucial technology aimed at improving the quality and intelligibility of speech signals in noisy environments. Recent advancements in deep neural networks have leveraged abundant clean speech datasets for supervised learning with remarkable results. However, supervised models suffer from poor robustness and generalization due to the scarcity of clean speech data and the complexity of the noise distribution in the real world. In this paper, a self-supervised speech enhancement model, called Multi-Information-Aware Speech Enhancement (MIA-SE), is proposed to address these challenges. A novel self-supervised training strategy is introduced in which denoising is performed on a single input twice, with the first denoiser output being employed as an Implicit Deep Denoiser Prior (IDDP) to supervise the subsequent denoising process. Furthermore, an encoder–decoder denoiser architecture based on a complex ratio masking strategy is incorporated to extract phase and magnitude features simultaneously. To capture sequence context information for improved embedding, transformer modules with multi-head attention mechanisms are integrated within the denoiser. The training process is guided by a newly formulated loss function to ensure successful and effective learning. Experimental results on synthetic and real-world noise databases demonstrate the effectiveness of MIA-SE, particularly in scenarios where paired training data is unavailable.

查看原文本刊更多论文

基于自监督学习的多信息感知语音增强

语音增强是一项旨在提高噪声环境下语音信号质量和可理解性的关键技术。深度神经网络的最新进展利用了大量干净的语音数据集进行监督学习，并取得了显著的成果。然而，由于真实世界中干净语音数据的稀缺性和噪声分布的复杂性，监督模型的鲁棒性和泛化性较差。本文提出了一种自监督语音增强模型，称为多信息感知语音增强（MIA-SE），以解决这些问题。提出了一种新的自监督训练策略，该策略对单个输入进行两次去噪，第一次去噪输出作为隐式深度去噪先验（IDDP）来监督后续去噪过程。在此基础上，采用基于复比掩蔽策略的编码器-解码器去噪结构，同时提取相位和幅度特征。为了捕获序列上下文信息以改进嵌入，在去噪器中集成了具有多头注意机制的变压器模块。训练过程由一个新制定的损失函数指导，以确保成功和有效的学习。在合成和真实噪声数据库上的实验结果证明了MIA-SE的有效性，特别是在无法获得成对训练数据的情况下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,