DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

arXiv - CS - Sound Pub Date : 2024-05-30 DOI:arxiv-2405.20289

Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan

{"title":"DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation","authors":"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan","doi":"arxiv-2405.20289","DOIUrl":null,"url":null,"abstract":"Controllable music generation methods are critical for human-centered\nAI-based music creation, but are currently limited by speed, quality, and\ncontrol design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\nparticular, offers state-of-the-art results, but is over 10x slower than\nreal-time, limiting practical use. We propose Distilled Diffusion\nInference-Time T -Optimization (or DITTO-2), a new method to speed up\ninference-time optimization-based control and unlock faster-than-real-time\ngeneration for a wide-variety of applications such as music inpainting,\noutpainting, intensity, melody, and musical structure control. Our method works\nby (1) distilling a pre-trained diffusion model for fast sampling via an\nefficient, modified consistency or consistency trajectory distillation process\n(2) performing inference-time optimization using our distilled model with\none-step sampling as an efficient surrogate optimization task and (3) running a\nfinal multi-step sampling generation (decoding) using our estimated noise\nlatents for best-quality, fast, controllable generation. Through thorough\nevaluation, we find our method not only speeds up generation over 10-20x, but\nsimultaneously improves control adherence and generation quality all at once.\nFurthermore, we apply our approach to a new application of maximizing text\nadherence (CLAP score) and show we can convert an unconditional diffusion model\nwithout text inputs into a model that yields state-of-the-art text control.\nSound examples can be found at https://ditto-music.github.io/ditto2/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.20289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.

查看原文本刊更多论文

DITTO-2：用于音乐生成的蒸馏扩散推理-时间 T 优化技术

可控音乐生成方法对于以人为中心的人工智能音乐创作至关重要，但目前受到速度、质量和控制设计权衡的限制。特别是扩散推理-时间 T 优化（DITTO），它提供了最先进的结果，但比实时速度慢 10 倍以上，限制了实际应用。我们提出了蒸馏扩散推理-时间优化（或 DITTO-2），这是一种新方法，可加快基于推理-时间优化的控制，并在音乐内画、外画、强度、旋律和音乐结构控制等多种应用中实现比实时更快的生成。我们的方法的工作原理是：(1) 通过一个高效、改进的一致性或一致性轨迹蒸馏过程，蒸馏出一个预训练的扩散模型，用于快速采样；(2) 使用我们蒸馏出的模型执行推理时间优化，将一步采样作为高效的替代优化任务；(3) 使用我们估计的噪声系数运行最终的多步采样生成（解码），以获得最佳质量、快速、可控的生成。通过深入评估，我们发现我们的方法不仅能将生成速度提高 10-20 倍，还能同时提高控制一致性和生成质量。此外，我们还将我们的方法应用于最大化文本一致性（CLAP 分数）的新应用中，并证明我们能将无文本输入的无条件扩散模型转换为能产生最先进文本控制的模型。更多实例请访问 https://ditto-music.github.io/ditto2/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量