Real-time audio enhancement framework for vocal performances based on LSTM and time-frequency masking algorithm

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-08-11 DOI:10.1016/j.csl.2025.101871

Zan Huang

{"title":"Real-time audio enhancement framework for vocal performances based on LSTM and time-frequency masking algorithm","authors":"Zan Huang","doi":"10.1016/j.csl.2025.101871","DOIUrl":null,"url":null,"abstract":"<div><div>This study proposes a new framework for real-time enhancement of vocal performances based on a long short-term memory (LSTM) network and a time-frequency masking algorithm. The framework primarily addresses the contradiction between non-stationary noise suppression and audio fidelity in complex acoustic scenes. The key innovations of this study are: 1. A real-time enhancement model combining LSTM and ideal ratio masking. The study uses an LSTM to model long-term dependencies in time-frequency features, combining it with an IRM algorithm that dynamically adjusts noise weights. This fusion significantly improves the clarity and intelligibility of audio signals in complex backgrounds. Experiments show that, within a signal-to-noise ratio range of -10 to 5 dB, the model's PESQ and STOI indicators improve to 3.75 and 0.893, respectively. 2. Adaptive Time-Frequency Masking Algorithm The study proposes an adaptive masking mechanism based on the dynamic weight of the signal-to-noise ratio, solving the trade-off between independent binary masking and IRM, as well as between distortion and noise suppression. 3. Masking coefficient optimization driven by a deep neural network. The study presents a bidirectional long short-term memory (LSTM) time-frequency processing module (TFPM) that hierarchically models intra-frame and inter-frame features. At the same time, a composite LSTM ratio masking (LSTM-RM) objective function is introduced to enhance both the amplitude and phase spectra simultaneously. Through end-to-end training, the proposed framework solves the real-time problem and demonstrates stable enhancement effects on ten types of noise test sets. The study provides a scalable algorithmic paradigm for real-time audio enhancement.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101871"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000968","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This study proposes a new framework for real-time enhancement of vocal performances based on a long short-term memory (LSTM) network and a time-frequency masking algorithm. The framework primarily addresses the contradiction between non-stationary noise suppression and audio fidelity in complex acoustic scenes. The key innovations of this study are: 1. A real-time enhancement model combining LSTM and ideal ratio masking. The study uses an LSTM to model long-term dependencies in time-frequency features, combining it with an IRM algorithm that dynamically adjusts noise weights. This fusion significantly improves the clarity and intelligibility of audio signals in complex backgrounds. Experiments show that, within a signal-to-noise ratio range of -10 to 5 dB, the model's PESQ and STOI indicators improve to 3.75 and 0.893, respectively. 2. Adaptive Time-Frequency Masking Algorithm The study proposes an adaptive masking mechanism based on the dynamic weight of the signal-to-noise ratio, solving the trade-off between independent binary masking and IRM, as well as between distortion and noise suppression. 3. Masking coefficient optimization driven by a deep neural network. The study presents a bidirectional long short-term memory (LSTM) time-frequency processing module (TFPM) that hierarchically models intra-frame and inter-frame features. At the same time, a composite LSTM ratio masking (LSTM-RM) objective function is introduced to enhance both the amplitude and phase spectra simultaneously. Through end-to-end training, the proposed framework solves the real-time problem and demonstrates stable enhancement effects on ten types of noise test sets. The study provides a scalable algorithmic paradigm for real-time audio enhancement.

查看原文本刊更多论文

基于LSTM和时频掩蔽算法的声乐表演实时音频增强框架

本研究提出了一种基于长短期记忆（LSTM）网络和时频掩蔽算法的实时增强声乐表演的新框架。该框架主要解决复杂声学场景中非平稳噪声抑制与音频保真度之间的矛盾。本研究的主要创新点有：1.数据分析。结合LSTM和理想比例掩蔽的实时增强模型。该研究使用LSTM来模拟时频特征的长期依赖关系，并将其与动态调整噪声权重的IRM算法相结合。这种融合显著提高了复杂背景下音频信号的清晰度和可理解性。实验表明，在信噪比为-10 ~ 5 dB的范围内，模型的PESQ和STOI指标分别提高到3.75和0.893。2. 本研究提出了一种基于信噪比动态权重的自适应掩蔽机制，解决了独立二进制掩蔽与IRM、失真与噪声抑制之间的权衡。3. 基于深度神经网络的掩蔽系数优化。本研究提出了一种双向长短期记忆时频处理模块（TFPM），该模块对帧内和帧间特征进行分层建模。同时，引入复合LSTM比掩蔽（LSTM- rm）目标函数，同时增强幅相谱。通过端到端训练，该框架解决了实时性问题，并对10类噪声测试集表现出稳定的增强效果。该研究为实时音频增强提供了一个可扩展的算法范例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.