Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang
{"title":"FlowSep:通过整流匹配进行语言查询声音分离","authors":"Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang","doi":"arxiv-2409.07614","DOIUrl":null,"url":null,"abstract":"Language-queried audio source separation (LASS) focuses on separating sounds\nusing textual descriptions of the desired sources. Current methods mainly use\ndiscriminative approaches, such as time-frequency masking, to separate target\nsounds and minimize interference from other sources. However, these models face\nchallenges when separating overlapping soundtracks, which may lead to artifacts\nsuch as spectral holes or incomplete separation. Rectified flow matching (RFM),\na generative model that establishes linear relations between the distribution\nof data and noise, offers superior theoretical properties and simplicity, but\nhas not yet been explored in sound separation. In this work, we introduce\nFlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns\nlinear flow trajectories from noise to target source features within the\nvariational autoencoder (VAE) latent space. During inference, the RFM-generated\nlatent features are reconstructed into a mel-spectrogram via the pre-trained\nVAE decoder, followed by a pre-trained vocoder to synthesize the waveform.\nTrained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art\nmodels across multiple benchmarks, as evaluated with subjective and objective\nmetrics. Additionally, our results show that FlowSep surpasses a\ndiffusion-based LASS model in both separation quality and inference efficiency,\nhighlighting its strong potential for audio source separation tasks. Code,\npre-trained models and demos can be found at:\nhttps://audio-agi.github.io/FlowSep_demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FlowSep: Language-Queried Sound Separation with Rectified Flow Matching\",\"authors\":\"Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang\",\"doi\":\"arxiv-2409.07614\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Language-queried audio source separation (LASS) focuses on separating sounds\\nusing textual descriptions of the desired sources. Current methods mainly use\\ndiscriminative approaches, such as time-frequency masking, to separate target\\nsounds and minimize interference from other sources. However, these models face\\nchallenges when separating overlapping soundtracks, which may lead to artifacts\\nsuch as spectral holes or incomplete separation. Rectified flow matching (RFM),\\na generative model that establishes linear relations between the distribution\\nof data and noise, offers superior theoretical properties and simplicity, but\\nhas not yet been explored in sound separation. In this work, we introduce\\nFlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns\\nlinear flow trajectories from noise to target source features within the\\nvariational autoencoder (VAE) latent space. During inference, the RFM-generated\\nlatent features are reconstructed into a mel-spectrogram via the pre-trained\\nVAE decoder, followed by a pre-trained vocoder to synthesize the waveform.\\nTrained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art\\nmodels across multiple benchmarks, as evaluated with subjective and objective\\nmetrics. Additionally, our results show that FlowSep surpasses a\\ndiffusion-based LASS model in both separation quality and inference efficiency,\\nhighlighting its strong potential for audio source separation tasks. Code,\\npre-trained models and demos can be found at:\\nhttps://audio-agi.github.io/FlowSep_demo/.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07614\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
FlowSep: Language-Queried Sound Separation with Rectified Flow Matching
Language-queried audio source separation (LASS) focuses on separating sounds
using textual descriptions of the desired sources. Current methods mainly use
discriminative approaches, such as time-frequency masking, to separate target
sounds and minimize interference from other sources. However, these models face
challenges when separating overlapping soundtracks, which may lead to artifacts
such as spectral holes or incomplete separation. Rectified flow matching (RFM),
a generative model that establishes linear relations between the distribution
of data and noise, offers superior theoretical properties and simplicity, but
has not yet been explored in sound separation. In this work, we introduce
FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns
linear flow trajectories from noise to target source features within the
variational autoencoder (VAE) latent space. During inference, the RFM-generated
latent features are reconstructed into a mel-spectrogram via the pre-trained
VAE decoder, followed by a pre-trained vocoder to synthesize the waveform.
Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art
models across multiple benchmarks, as evaluated with subjective and objective
metrics. Additionally, our results show that FlowSep surpasses a
diffusion-based LASS model in both separation quality and inference efficiency,
highlighting its strong potential for audio source separation tasks. Code,
pre-trained models and demos can be found at:
https://audio-agi.github.io/FlowSep_demo/.