FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI:arxiv-2409.07614

Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang

{"title":"FlowSep: Language-Queried Sound Separation with Rectified Flow Matching","authors":"Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang","doi":"arxiv-2409.07614","DOIUrl":null,"url":null,"abstract":"Language-queried audio source separation (LASS) focuses on separating sounds\nusing textual descriptions of the desired sources. Current methods mainly use\ndiscriminative approaches, such as time-frequency masking, to separate target\nsounds and minimize interference from other sources. However, these models face\nchallenges when separating overlapping soundtracks, which may lead to artifacts\nsuch as spectral holes or incomplete separation. Rectified flow matching (RFM),\na generative model that establishes linear relations between the distribution\nof data and noise, offers superior theoretical properties and simplicity, but\nhas not yet been explored in sound separation. In this work, we introduce\nFlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns\nlinear flow trajectories from noise to target source features within the\nvariational autoencoder (VAE) latent space. During inference, the RFM-generated\nlatent features are reconstructed into a mel-spectrogram via the pre-trained\nVAE decoder, followed by a pre-trained vocoder to synthesize the waveform.\nTrained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art\nmodels across multiple benchmarks, as evaluated with subjective and objective\nmetrics. Additionally, our results show that FlowSep surpasses a\ndiffusion-based LASS model in both separation quality and inference efficiency,\nhighlighting its strong potential for audio source separation tasks. Code,\npre-trained models and demos can be found at:\nhttps://audio-agi.github.io/FlowSep_demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"460 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Language-queried audio source separation (LASS) focuses on separating sounds using textual descriptions of the desired sources. Current methods mainly use discriminative approaches, such as time-frequency masking, to separate target sounds and minimize interference from other sources. However, these models face challenges when separating overlapping soundtracks, which may lead to artifacts such as spectral holes or incomplete separation. Rectified flow matching (RFM), a generative model that establishes linear relations between the distribution of data and noise, offers superior theoretical properties and simplicity, but has not yet been explored in sound separation. In this work, we introduce FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns linear flow trajectories from noise to target source features within the variational autoencoder (VAE) latent space. During inference, the RFM-generated latent features are reconstructed into a mel-spectrogram via the pre-trained VAE decoder, followed by a pre-trained vocoder to synthesize the waveform. Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art models across multiple benchmarks, as evaluated with subjective and objective metrics. Additionally, our results show that FlowSep surpasses a diffusion-based LASS model in both separation quality and inference efficiency, highlighting its strong potential for audio source separation tasks. Code, pre-trained models and demos can be found at: https://audio-agi.github.io/FlowSep_demo/.

查看原文本刊更多论文

FlowSep：通过整流匹配进行语言查询声音分离

语言查询音源分离（LASS）主要是通过对所需音源的文字描述来分离声音。目前的方法主要使用时间频率掩蔽等鉴别方法来分离目标声音，并尽量减少其他音源的干扰。然而，这些模型在分离重叠音轨时面临挑战，可能会导致频谱孔洞或分离不完全等假象。整流匹配（RFM）是一种在数据和噪声分布之间建立线性关系的生成模型，具有优越的理论特性和简便性，但尚未在声音分离中得到应用。在这项工作中，我们引入了基于 RFM 的新生成模型 FlowSep，用于 LASS 任务。FlowSep 在变异自动编码器（VAE）潜空间内学习从噪声到目标声源特征的线性流动轨迹。在推理过程中，RFM 生成的潜在特征会通过预训练的 VAE 解码器重构为 mel 光谱图，然后再通过预训练的声码器合成波形。此外，我们的结果表明，FlowSep 在分离质量和推理效率方面都超过了基于扩散的 LASS 模型，这凸显了它在音源分离任务中的强大潜力。有关代码、预训练模型和演示，请访问：https://audio-agi.github.io/FlowSep_demo/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量