Multi-Source Music Generation with Latent Diffusion

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI:arxiv-2409.06190

Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury

{"title":"Multi-Source Music Generation with Latent Diffusion","authors":"Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury","doi":"arxiv-2409.06190","DOIUrl":null,"url":null,"abstract":"Most music generation models directly generate a single music mixture. To\nallow for more flexible and controllable generation, the Multi-Source Diffusion\nModel (MSDM) has been proposed to model music as a mixture of multiple\ninstrumental sources (e.g., piano, drums, bass, and guitar). Its goal is to use\none single diffusion model to generate consistent music sources, which are\nfurther mixed to form the music. Despite its capabilities, MSDM is unable to\ngenerate songs with rich melodies and often generates empty sounds. Also, its\nwaveform diffusion introduces significant Gaussian noise artifacts, which\ncompromises audio quality. In response, we introduce a multi-source latent\ndiffusion model (MSLDM) that employs Variational Autoencoders (VAEs) to encode\neach instrumental source into a distinct latent representation. By training a\nVAE on all music sources, we efficiently capture each source's unique\ncharacteristics in a source latent that our diffusion model models jointly.\nThis approach significantly enhances the total and partial generation of music\nby leveraging the VAE's latent compression and noise-robustness. The compressed\nsource latent also facilitates more efficient generation. Subjective listening\ntests and Frechet Audio Distance (FAD) scores confirm that our model\noutperforms MSDM, showcasing its practical and enhanced applicability in music\ngeneration systems. We also emphasize that modeling sources is more effective\nthan direct music mixture modeling. Codes and models are available at\nhttps://github.com/XZWY/MSLDM. Demos are available at\nhttps://xzwy.github.io/MSLDMDemo.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g., piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate consistent music sources, which are further mixed to form the music. Despite its capabilities, MSDM is unable to generate songs with rich melodies and often generates empty sounds. Also, its waveform diffusion introduces significant Gaussian noise artifacts, which compromises audio quality. In response, we introduce a multi-source latent diffusion model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source's unique characteristics in a source latent that our diffusion model models jointly. This approach significantly enhances the total and partial generation of music by leveraging the VAE's latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Frechet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at https://github.com/XZWY/MSLDM. Demos are available at https://xzwy.github.io/MSLDMDemo.

查看原文本刊更多论文

利用潜在扩散生成多源音乐

大多数音乐生成模型直接生成单一的音乐混合物。为了实现更灵活、更可控的生成，有人提出了多源扩散模型（Multi-Source DiffusionModel，MSDM），将音乐建模为多个乐器源（如钢琴、鼓、贝斯和吉他）的混合物。其目标是使用单一的扩散模型生成一致的音乐源，并进一步混合形成音乐。尽管 MSDM 功能强大，但它无法生成旋律丰富的歌曲，而且经常生成空洞的声音。此外，它的波形扩散会带来明显的高斯噪声假象，从而影响音频质量。为此，我们引入了多源潜扩散模型（MSLDM），该模型采用变异自动编码器（VAE）将每个乐器源编码为不同的潜表示。通过在所有音乐源上训练变异自动编码器，我们可以在源潜象中有效捕捉每个音乐源的独特特征，并由我们的扩散模型对其进行联合建模。压缩源潜质还有助于提高生成效率。主观听力测试和弗雷谢特音频距离（FAD）评分证实了我们的模型优于 MSDM，展示了它在音乐生成系统中的实用性和更强的适用性。我们还强调，源建模比直接音乐混合建模更有效。代码和模型可在https://github.com/XZWY/MSLDM。演示版可在https://xzwy.github.io/MSLDMDemo。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量