FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang
{"title":"FlowSep: Language-Queried Sound Separation with Rectified Flow Matching","authors":"Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang","doi":"arxiv-2409.07614","DOIUrl":null,"url":null,"abstract":"Language-queried audio source separation (LASS) focuses on separating sounds\nusing textual descriptions of the desired sources. Current methods mainly use\ndiscriminative approaches, such as time-frequency masking, to separate target\nsounds and minimize interference from other sources. However, these models face\nchallenges when separating overlapping soundtracks, which may lead to artifacts\nsuch as spectral holes or incomplete separation. Rectified flow matching (RFM),\na generative model that establishes linear relations between the distribution\nof data and noise, offers superior theoretical properties and simplicity, but\nhas not yet been explored in sound separation. In this work, we introduce\nFlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns\nlinear flow trajectories from noise to target source features within the\nvariational autoencoder (VAE) latent space. During inference, the RFM-generated\nlatent features are reconstructed into a mel-spectrogram via the pre-trained\nVAE decoder, followed by a pre-trained vocoder to synthesize the waveform.\nTrained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art\nmodels across multiple benchmarks, as evaluated with subjective and objective\nmetrics. Additionally, our results show that FlowSep surpasses a\ndiffusion-based LASS model in both separation quality and inference efficiency,\nhighlighting its strong potential for audio source separation tasks. Code,\npre-trained models and demos can be found at:\nhttps://audio-agi.github.io/FlowSep_demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Language-queried audio source separation (LASS) focuses on separating sounds using textual descriptions of the desired sources. Current methods mainly use discriminative approaches, such as time-frequency masking, to separate target sounds and minimize interference from other sources. However, these models face challenges when separating overlapping soundtracks, which may lead to artifacts such as spectral holes or incomplete separation. Rectified flow matching (RFM), a generative model that establishes linear relations between the distribution of data and noise, offers superior theoretical properties and simplicity, but has not yet been explored in sound separation. In this work, we introduce FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns linear flow trajectories from noise to target source features within the variational autoencoder (VAE) latent space. During inference, the RFM-generated latent features are reconstructed into a mel-spectrogram via the pre-trained VAE decoder, followed by a pre-trained vocoder to synthesize the waveform. Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art models across multiple benchmarks, as evaluated with subjective and objective metrics. Additionally, our results show that FlowSep surpasses a diffusion-based LASS model in both separation quality and inference efficiency, highlighting its strong potential for audio source separation tasks. Code, pre-trained models and demos can be found at: https://audio-agi.github.io/FlowSep_demo/.
FlowSep:通过整流匹配进行语言查询声音分离
语言查询音源分离(LASS)主要是通过对所需音源的文字描述来分离声音。目前的方法主要使用时间频率掩蔽等鉴别方法来分离目标声音,并尽量减少其他音源的干扰。然而,这些模型在分离重叠音轨时面临挑战,可能会导致频谱孔洞或分离不完全等假象。整流匹配(RFM)是一种在数据和噪声分布之间建立线性关系的生成模型,具有优越的理论特性和简便性,但尚未在声音分离中得到应用。在这项工作中,我们引入了基于 RFM 的新生成模型 FlowSep,用于 LASS 任务。FlowSep 在变异自动编码器(VAE)潜空间内学习从噪声到目标声源特征的线性流动轨迹。在推理过程中,RFM 生成的潜在特征会通过预训练的 VAE 解码器重构为 mel 光谱图,然后再通过预训练的声码器合成波形。此外,我们的结果表明,FlowSep 在分离质量和推理效率方面都超过了基于扩散的 LASS 模型,这凸显了它在音源分离任务中的强大潜力。有关代码、预训练模型和演示,请访问:https://audio-agi.github.io/FlowSep_demo/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信