{"title":"Real-time audio enhancement framework for vocal performances based on LSTM and time-frequency masking algorithm","authors":"Zan Huang","doi":"10.1016/j.csl.2025.101871","DOIUrl":null,"url":null,"abstract":"<div><div>This study proposes a new framework for real-time enhancement of vocal performances based on a long short-term memory (LSTM) network and a time-frequency masking algorithm. The framework primarily addresses the contradiction between non-stationary noise suppression and audio fidelity in complex acoustic scenes. The key innovations of this study are: 1. A real-time enhancement model combining LSTM and ideal ratio masking. The study uses an LSTM to model long-term dependencies in time-frequency features, combining it with an IRM algorithm that dynamically adjusts noise weights. This fusion significantly improves the clarity and intelligibility of audio signals in complex backgrounds. Experiments show that, within a signal-to-noise ratio range of -10 to 5 dB, the model's PESQ and STOI indicators improve to 3.75 and 0.893, respectively. 2. Adaptive Time-Frequency Masking Algorithm The study proposes an adaptive masking mechanism based on the dynamic weight of the signal-to-noise ratio, solving the trade-off between independent binary masking and IRM, as well as between distortion and noise suppression. 3. Masking coefficient optimization driven by a deep neural network. The study presents a bidirectional long short-term memory (LSTM) time-frequency processing module (TFPM) that hierarchically models intra-frame and inter-frame features. At the same time, a composite LSTM ratio masking (LSTM-RM) objective function is introduced to enhance both the amplitude and phase spectra simultaneously. Through end-to-end training, the proposed framework solves the real-time problem and demonstrates stable enhancement effects on ten types of noise test sets. The study provides a scalable algorithmic paradigm for real-time audio enhancement.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101871"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000968","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
This study proposes a new framework for real-time enhancement of vocal performances based on a long short-term memory (LSTM) network and a time-frequency masking algorithm. The framework primarily addresses the contradiction between non-stationary noise suppression and audio fidelity in complex acoustic scenes. The key innovations of this study are: 1. A real-time enhancement model combining LSTM and ideal ratio masking. The study uses an LSTM to model long-term dependencies in time-frequency features, combining it with an IRM algorithm that dynamically adjusts noise weights. This fusion significantly improves the clarity and intelligibility of audio signals in complex backgrounds. Experiments show that, within a signal-to-noise ratio range of -10 to 5 dB, the model's PESQ and STOI indicators improve to 3.75 and 0.893, respectively. 2. Adaptive Time-Frequency Masking Algorithm The study proposes an adaptive masking mechanism based on the dynamic weight of the signal-to-noise ratio, solving the trade-off between independent binary masking and IRM, as well as between distortion and noise suppression. 3. Masking coefficient optimization driven by a deep neural network. The study presents a bidirectional long short-term memory (LSTM) time-frequency processing module (TFPM) that hierarchically models intra-frame and inter-frame features. At the same time, a composite LSTM ratio masking (LSTM-RM) objective function is introduced to enhance both the amplitude and phase spectra simultaneously. Through end-to-end training, the proposed framework solves the real-time problem and demonstrates stable enhancement effects on ten types of noise test sets. The study provides a scalable algorithmic paradigm for real-time audio enhancement.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.