Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury
{"title":"Multi-Source Music Generation with Latent Diffusion","authors":"Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury","doi":"arxiv-2409.06190","DOIUrl":null,"url":null,"abstract":"Most music generation models directly generate a single music mixture. To\nallow for more flexible and controllable generation, the Multi-Source Diffusion\nModel (MSDM) has been proposed to model music as a mixture of multiple\ninstrumental sources (e.g., piano, drums, bass, and guitar). Its goal is to use\none single diffusion model to generate consistent music sources, which are\nfurther mixed to form the music. Despite its capabilities, MSDM is unable to\ngenerate songs with rich melodies and often generates empty sounds. Also, its\nwaveform diffusion introduces significant Gaussian noise artifacts, which\ncompromises audio quality. In response, we introduce a multi-source latent\ndiffusion model (MSLDM) that employs Variational Autoencoders (VAEs) to encode\neach instrumental source into a distinct latent representation. By training a\nVAE on all music sources, we efficiently capture each source's unique\ncharacteristics in a source latent that our diffusion model models jointly.\nThis approach significantly enhances the total and partial generation of music\nby leveraging the VAE's latent compression and noise-robustness. The compressed\nsource latent also facilitates more efficient generation. Subjective listening\ntests and Frechet Audio Distance (FAD) scores confirm that our model\noutperforms MSDM, showcasing its practical and enhanced applicability in music\ngeneration systems. We also emphasize that modeling sources is more effective\nthan direct music mixture modeling. Codes and models are available at\nhttps://github.com/XZWY/MSLDM. Demos are available at\nhttps://xzwy.github.io/MSLDMDemo.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Most music generation models directly generate a single music mixture. To
allow for more flexible and controllable generation, the Multi-Source Diffusion
Model (MSDM) has been proposed to model music as a mixture of multiple
instrumental sources (e.g., piano, drums, bass, and guitar). Its goal is to use
one single diffusion model to generate consistent music sources, which are
further mixed to form the music. Despite its capabilities, MSDM is unable to
generate songs with rich melodies and often generates empty sounds. Also, its
waveform diffusion introduces significant Gaussian noise artifacts, which
compromises audio quality. In response, we introduce a multi-source latent
diffusion model (MSLDM) that employs Variational Autoencoders (VAEs) to encode
each instrumental source into a distinct latent representation. By training a
VAE on all music sources, we efficiently capture each source's unique
characteristics in a source latent that our diffusion model models jointly.
This approach significantly enhances the total and partial generation of music
by leveraging the VAE's latent compression and noise-robustness. The compressed
source latent also facilitates more efficient generation. Subjective listening
tests and Frechet Audio Distance (FAD) scores confirm that our model
outperforms MSDM, showcasing its practical and enhanced applicability in music
generation systems. We also emphasize that modeling sources is more effective
than direct music mixture modeling. Codes and models are available at
https://github.com/XZWY/MSLDM. Demos are available at
https://xzwy.github.io/MSLDMDemo.