StyleCrafter: Taming Artistic Video Diffusion with Reference-Augmented Adapter Learning

IF 7.8 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Graphics Pub Date : 2024-11-19 DOI:10.1145/3687975

Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Yibo Wang, Xintao Wang, Ying Shan, Yujiu Yang

{"title":"StyleCrafter: Taming Artistic Video Diffusion with Reference-Augmented Adapter Learning","authors":"Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Yibo Wang, Xintao Wang, Ying Shan, Yujiu Yang","doi":"10.1145/3687975","DOIUrl":null,"url":null,"abstract":"Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired artistic videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pretrained T2V models with a style control adapter, allowing video generation in any style by feeding a reference image. Considering the scarcity of artistic video data, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we employ carefully designed data augmentation strategies to enhance decoupled learning. Additionally, we propose a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors. Project page: https://gongyeliu.github.io/StyleCrafter.github.io/","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"176 1","pages":""},"PeriodicalIF":7.8000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Graphics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3687975","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired artistic videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pretrained T2V models with a style control adapter, allowing video generation in any style by feeding a reference image. Considering the scarcity of artistic video data, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we employ carefully designed data augmentation strategies to enhance decoupled learning. Additionally, we propose a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors. Project page: https://gongyeliu.github.io/StyleCrafter.github.io/

查看原文本刊更多论文

StyleCrafter：通过参考增强适配器学习控制艺术视频扩散

文本到视频（T2V）模型在生成多样化视频方面表现出了卓越的能力。然而，由于(i) 文本在表达特定风格时固有的笨拙性和(ii) 普遍降低的风格保真度，它们很难生成用户所需的艺术视频。为了应对这些挑战，我们引入了 StyleCrafter，这是一种通用方法，它利用风格控制适配器来增强预训练的 T2V 模型，通过提供参考图像来生成任何风格的视频。考虑到艺术视频数据的稀缺性，我们建议首先使用风格丰富的图像数据集训练风格控制适配器，然后通过量身定制的微调范式将所学的风格化能力转移到视频生成中。为了促进内容与风格的分离，我们采用了精心设计的数据增强策略来加强解耦学习。此外，我们还提出了一个规模自适应融合模块，以平衡基于文本的内容特征和基于图像的风格特征的影响，这有助于在各种文本和风格组合中实现泛化。StyleCrafter 能高效生成高质量的风格化视频，这些视频既与文本内容一致，又与参考图像的风格相似。实验证明，我们的方法比现有的竞争对手更灵活、更高效。项目页面： https://gongyeliu.github.io/StyleCrafter.github.io/

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Graphics 工程技术-计算机：软件工程

CiteScore

14.30

自引率

25.80%

发文量

193

审稿时长

12 months

期刊介绍： ACM Transactions on Graphics (TOG) is a peer-reviewed scientific journal that aims to disseminate the latest findings of note in the field of computer graphics. It has been published since 1982 by the Association for Computing Machinery. Starting in 2003, all papers accepted for presentation at the annual SIGGRAPH conference are printed in a special summer issue of the journal.