Real-time audio enhancement framework for vocal performances based on LSTM and time-frequency masking algorithm

IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Zan Huang
{"title":"Real-time audio enhancement framework for vocal performances based on LSTM and time-frequency masking algorithm","authors":"Zan Huang","doi":"10.1016/j.csl.2025.101871","DOIUrl":null,"url":null,"abstract":"<div><div>This study proposes a new framework for real-time enhancement of vocal performances based on a long short-term memory (LSTM) network and a time-frequency masking algorithm. The framework primarily addresses the contradiction between non-stationary noise suppression and audio fidelity in complex acoustic scenes. The key innovations of this study are: 1. A real-time enhancement model combining LSTM and ideal ratio masking. The study uses an LSTM to model long-term dependencies in time-frequency features, combining it with an IRM algorithm that dynamically adjusts noise weights. This fusion significantly improves the clarity and intelligibility of audio signals in complex backgrounds. Experiments show that, within a signal-to-noise ratio range of -10 to 5 dB, the model's PESQ and STOI indicators improve to 3.75 and 0.893, respectively. 2. Adaptive Time-Frequency Masking Algorithm The study proposes an adaptive masking mechanism based on the dynamic weight of the signal-to-noise ratio, solving the trade-off between independent binary masking and IRM, as well as between distortion and noise suppression. 3. Masking coefficient optimization driven by a deep neural network. The study presents a bidirectional long short-term memory (LSTM) time-frequency processing module (TFPM) that hierarchically models intra-frame and inter-frame features. At the same time, a composite LSTM ratio masking (LSTM-RM) objective function is introduced to enhance both the amplitude and phase spectra simultaneously. Through end-to-end training, the proposed framework solves the real-time problem and demonstrates stable enhancement effects on ten types of noise test sets. The study provides a scalable algorithmic paradigm for real-time audio enhancement.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101871"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000968","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

This study proposes a new framework for real-time enhancement of vocal performances based on a long short-term memory (LSTM) network and a time-frequency masking algorithm. The framework primarily addresses the contradiction between non-stationary noise suppression and audio fidelity in complex acoustic scenes. The key innovations of this study are: 1. A real-time enhancement model combining LSTM and ideal ratio masking. The study uses an LSTM to model long-term dependencies in time-frequency features, combining it with an IRM algorithm that dynamically adjusts noise weights. This fusion significantly improves the clarity and intelligibility of audio signals in complex backgrounds. Experiments show that, within a signal-to-noise ratio range of -10 to 5 dB, the model's PESQ and STOI indicators improve to 3.75 and 0.893, respectively. 2. Adaptive Time-Frequency Masking Algorithm The study proposes an adaptive masking mechanism based on the dynamic weight of the signal-to-noise ratio, solving the trade-off between independent binary masking and IRM, as well as between distortion and noise suppression. 3. Masking coefficient optimization driven by a deep neural network. The study presents a bidirectional long short-term memory (LSTM) time-frequency processing module (TFPM) that hierarchically models intra-frame and inter-frame features. At the same time, a composite LSTM ratio masking (LSTM-RM) objective function is introduced to enhance both the amplitude and phase spectra simultaneously. Through end-to-end training, the proposed framework solves the real-time problem and demonstrates stable enhancement effects on ten types of noise test sets. The study provides a scalable algorithmic paradigm for real-time audio enhancement.
基于LSTM和时频掩蔽算法的声乐表演实时音频增强框架
本研究提出了一种基于长短期记忆(LSTM)网络和时频掩蔽算法的实时增强声乐表演的新框架。该框架主要解决复杂声学场景中非平稳噪声抑制与音频保真度之间的矛盾。本研究的主要创新点有:1.数据分析。结合LSTM和理想比例掩蔽的实时增强模型。该研究使用LSTM来模拟时频特征的长期依赖关系,并将其与动态调整噪声权重的IRM算法相结合。这种融合显著提高了复杂背景下音频信号的清晰度和可理解性。实验表明,在信噪比为-10 ~ 5 dB的范围内,模型的PESQ和STOI指标分别提高到3.75和0.893。2. 本研究提出了一种基于信噪比动态权重的自适应掩蔽机制,解决了独立二进制掩蔽与IRM、失真与噪声抑制之间的权衡。3. 基于深度神经网络的掩蔽系数优化。本研究提出了一种双向长短期记忆时频处理模块(TFPM),该模块对帧内和帧间特征进行分层建模。同时,引入复合LSTM比掩蔽(LSTM- rm)目标函数,同时增强幅相谱。通过端到端训练,该框架解决了实时性问题,并对10类噪声测试集表现出稳定的增强效果。该研究为实时音频增强提供了一个可扩展的算法范例。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信