Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan
{"title":"DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation","authors":"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan","doi":"arxiv-2405.20289","DOIUrl":null,"url":null,"abstract":"Controllable music generation methods are critical for human-centered\nAI-based music creation, but are currently limited by speed, quality, and\ncontrol design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\nparticular, offers state-of-the-art results, but is over 10x slower than\nreal-time, limiting practical use. We propose Distilled Diffusion\nInference-Time T -Optimization (or DITTO-2), a new method to speed up\ninference-time optimization-based control and unlock faster-than-real-time\ngeneration for a wide-variety of applications such as music inpainting,\noutpainting, intensity, melody, and musical structure control. Our method works\nby (1) distilling a pre-trained diffusion model for fast sampling via an\nefficient, modified consistency or consistency trajectory distillation process\n(2) performing inference-time optimization using our distilled model with\none-step sampling as an efficient surrogate optimization task and (3) running a\nfinal multi-step sampling generation (decoding) using our estimated noise\nlatents for best-quality, fast, controllable generation. Through thorough\nevaluation, we find our method not only speeds up generation over 10-20x, but\nsimultaneously improves control adherence and generation quality all at once.\nFurthermore, we apply our approach to a new application of maximizing text\nadherence (CLAP score) and show we can convert an unconditional diffusion model\nwithout text inputs into a model that yields state-of-the-art text control.\nSound examples can be found at https://ditto-music.github.io/ditto2/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.20289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Controllable music generation methods are critical for human-centered
AI-based music creation, but are currently limited by speed, quality, and
control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in
particular, offers state-of-the-art results, but is over 10x slower than
real-time, limiting practical use. We propose Distilled Diffusion
Inference-Time T -Optimization (or DITTO-2), a new method to speed up
inference-time optimization-based control and unlock faster-than-real-time
generation for a wide-variety of applications such as music inpainting,
outpainting, intensity, melody, and musical structure control. Our method works
by (1) distilling a pre-trained diffusion model for fast sampling via an
efficient, modified consistency or consistency trajectory distillation process
(2) performing inference-time optimization using our distilled model with
one-step sampling as an efficient surrogate optimization task and (3) running a
final multi-step sampling generation (decoding) using our estimated noise
latents for best-quality, fast, controllable generation. Through thorough
evaluation, we find our method not only speeds up generation over 10-20x, but
simultaneously improves control adherence and generation quality all at once.
Furthermore, we apply our approach to a new application of maximizing text
adherence (CLAP score) and show we can convert an unconditional diffusion model
without text inputs into a model that yields state-of-the-art text control.
Sound examples can be found at https://ditto-music.github.io/ditto2/.