{"title":"Ultra-Low Latency Speech Enhancement - A Comprehensive Study","authors":"Haibin Wu, Sebastian Braun","doi":"arxiv-2409.10358","DOIUrl":null,"url":null,"abstract":"Speech enhancement models should meet very low latency requirements typically\nsmaller than 5 ms for hearing assistive devices. While various low-latency\ntechniques have been proposed, comparing these methods in a controlled setup\nusing DNNs remains blank. Previous papers have variations in task, training\ndata, scripts, and evaluation settings, which make fair comparison impossible.\nMoreover, all methods are tested on small, simulated datasets, making it\ndifficult to fairly assess their performance in real-world conditions, which\ncould impact the reliability of scientific findings. To address these issues,\nwe comprehensively investigate various low-latency techniques using consistent\ntraining on large-scale data and evaluate with more relevant metrics on\nreal-world data. Specifically, we explore the effectiveness of asymmetric\nwindows, learnable windows, adaptive time domain filterbanks, and the\nfuture-frame prediction technique. Additionally, we examine whether increasing\nthe model size can compensate for the reduced window size, as well as the novel\nMamba architecture in low-latency environments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10358","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech enhancement models should meet very low latency requirements typically
smaller than 5 ms for hearing assistive devices. While various low-latency
techniques have been proposed, comparing these methods in a controlled setup
using DNNs remains blank. Previous papers have variations in task, training
data, scripts, and evaluation settings, which make fair comparison impossible.
Moreover, all methods are tested on small, simulated datasets, making it
difficult to fairly assess their performance in real-world conditions, which
could impact the reliability of scientific findings. To address these issues,
we comprehensively investigate various low-latency techniques using consistent
training on large-scale data and evaluate with more relevant metrics on
real-world data. Specifically, we explore the effectiveness of asymmetric
windows, learnable windows, adaptive time domain filterbanks, and the
future-frame prediction technique. Additionally, we examine whether increasing
the model size can compensate for the reduced window size, as well as the novel
Mamba architecture in low-latency environments.