Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI:10.1109/TASLP.2024.3485485

Jinlong Xue;Yayue Deng;Yingming Gao;Ya Li

{"title":"Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation","authors":"Jinlong Xue;Yayue Deng;Yingming Gao;Ya Li","doi":"10.1109/TASLP.2024.3485485","DOIUrl":null,"url":null,"abstract":"Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of generation tasks. Text-to-Audio (TTA), a burgeoning generation application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resources. Furthermore, the text encoder serves as a critical bridge between text and audio, since it acts as an instruction for the diffusion model to generate coherent content. Previous studies in T2I recognize the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments, being the first to reveal the internal mechanisms in the TTA field and intuitively explain how different text encoders influence the diffusion process. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which is further demonstrated in several related tasks, such as audio style transfer, inpainting, and other manipulations.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4700-4712"},"PeriodicalIF":5.1000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10731578/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of generation tasks. Text-to-Audio (TTA), a burgeoning generation application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resources. Furthermore, the text encoder serves as a critical bridge between text and audio, since it acts as an instruction for the diffusion model to generate coherent content. Previous studies in T2I recognize the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments, being the first to reveal the internal mechanisms in the TTA field and intuitively explain how different text encoders influence the diffusion process. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which is further demonstrated in several related tasks, such as audio style transfer, inpainting, and other manipulations.

查看原文本刊更多论文

Auffusion：利用扩散和大型语言模型的力量进行文本到音频生成

扩散模型和大型语言模型（LLM）的最新进展极大地推动了生成任务领域的发展。文本到音频（Text-to-Audio，TTA）是一种新兴的生成应用，旨在根据自然语言提示生成音频，正吸引着越来越多的关注。然而，现有的文本-音频生成研究往往在生成质量和文本-音频对齐方面存在问题，尤其是对于复杂的文本输入。我们从最先进的 "文本到图像"（T2I）扩散模型中汲取灵感，推出了 Auffusion 系统，该系统通过有效利用 T2I 模型固有的生成优势和精确的跨模态对齐，将 T2I 模型框架与 TTA 任务相匹配。我们的客观和主观评估结果表明，Auffusion 超越了以往使用有限数据和计算资源的 TTA 方法。此外，文本编码器是文本和音频之间的重要桥梁，因为它是扩散模型生成连贯内容的指令。以往的 T2I 研究认识到编码器的选择对跨模态对齐（如细粒度细节和对象绑定）的重大影响，而以往的 TTA 研究则缺乏类似的评估。通过全面的消融研究和创新的交叉注意图可视化，我们提供了具有洞察力的评估，首次揭示了 TTA 领域的内部机制，并直观地解释了不同文本编码器如何影响扩散过程。我们的研究结果揭示了 Auffusion 在生成与文本描述精确匹配的音频方面的卓越能力，这一点在音频风格转移、内画和其他操作等多个相关任务中得到了进一步证明。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.