Multi-instrument Music Synthesis with Spectrogram Diffusion

International Society for Music Information Retrieval Conference Pub Date : 2022-06-11 DOI:10.48550/arXiv.2206.05408

Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, Jesse Engel

{"title":"Multi-instrument Music Synthesis with Spectrogram Diffusion","authors":"Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, Jesse Engel","doi":"10.48550/arXiv.2206.05408","DOIUrl":null,"url":null,"abstract":"An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fr\\'echet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Society for Music Information Retrieval Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2206.05408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fr\'echet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.

查看原文本刊更多论文

多乐器音乐合成与谱图扩散

理想的音乐合成器应该具有互动性和表现力，能够为任意乐器和音符的组合实时生成高保真音频。最近的神经合成器在特定领域模型(只提供特定乐器的详细控制)和原始波形模型(可以在任何音乐上训练，但控制最小且生成缓慢)之间进行了权衡。在这项工作中，我们专注于神经合成器的中间地带，它可以实时地从MIDI序列中任意组合乐器生成音频。这使得培训范围广泛的转录数据集与单一模型，这反过来又提供了在广泛的仪器组成和仪器的笔记级控制。我们使用一个简单的两阶段过程:使用编码器-解码器变压器将MIDI转换为频谱图，然后使用生成对抗网络(GAN)频谱图逆变器将频谱图转换为音频。我们将解码器训练作为自回归模型和作为去噪扩散概率模型(DDPM)进行了比较，发现DDPM方法在质量上以及通过音频重建和帧距离度量来衡量都是优越的。考虑到这种方法的互动性和普遍性，我们发现这是朝着乐器和音符任意组合的交互式和表达性神经合成迈出的有希望的第一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Society for Music Information Retrieval Conference

自引率

0.00%

发文量