Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang
{"title":"LiteFocus:用于长音频合成的加速扩散推理","authors":"Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang","doi":"arxiv-2407.10468","DOIUrl":null,"url":null,"abstract":"Latent diffusion models have shown promising results in audio generation,\nmaking notable advancements over traditional methods. However, their\nperformance, while impressive with short audio clips, faces challenges when\nextended to longer audio sequences. These challenges are due to model's\nself-attention mechanism and training predominantly on 10-second clips, which\ncomplicates the extension to longer audio without adaptation. In response to\nthese issues, we introduce a novel approach, LiteFocus that enhances the\ninference of existing audio latent diffusion models in long audio synthesis.\nObserved the attention pattern in self-attention, we employ a dual sparse form\nfor attention calculation, designated as same-frequency focus and\ncross-frequency compensation, which curtails the attention computation under\nsame-frequency constraints, while enhancing audio quality through\ncross-frequency refillment. LiteFocus demonstrates substantial reduction on\ninference time with diffusion-based TTA model by 1.99x in synthesizing\n80-second audio clips while also obtaining improved audio quality.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis\",\"authors\":\"Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang\",\"doi\":\"arxiv-2407.10468\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Latent diffusion models have shown promising results in audio generation,\\nmaking notable advancements over traditional methods. However, their\\nperformance, while impressive with short audio clips, faces challenges when\\nextended to longer audio sequences. These challenges are due to model's\\nself-attention mechanism and training predominantly on 10-second clips, which\\ncomplicates the extension to longer audio without adaptation. In response to\\nthese issues, we introduce a novel approach, LiteFocus that enhances the\\ninference of existing audio latent diffusion models in long audio synthesis.\\nObserved the attention pattern in self-attention, we employ a dual sparse form\\nfor attention calculation, designated as same-frequency focus and\\ncross-frequency compensation, which curtails the attention computation under\\nsame-frequency constraints, while enhancing audio quality through\\ncross-frequency refillment. LiteFocus demonstrates substantial reduction on\\ninference time with diffusion-based TTA model by 1.99x in synthesizing\\n80-second audio clips while also obtaining improved audio quality.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.10468\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.10468","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis
Latent diffusion models have shown promising results in audio generation,
making notable advancements over traditional methods. However, their
performance, while impressive with short audio clips, faces challenges when
extended to longer audio sequences. These challenges are due to model's
self-attention mechanism and training predominantly on 10-second clips, which
complicates the extension to longer audio without adaptation. In response to
these issues, we introduce a novel approach, LiteFocus that enhances the
inference of existing audio latent diffusion models in long audio synthesis.
Observed the attention pattern in self-attention, we employ a dual sparse form
for attention calculation, designated as same-frequency focus and
cross-frequency compensation, which curtails the attention computation under
same-frequency constraints, while enhancing audio quality through
cross-frequency refillment. LiteFocus demonstrates substantial reduction on
inference time with diffusion-based TTA model by 1.99x in synthesizing
80-second audio clips while also obtaining improved audio quality.