{"title":"MotionCrafter: Plug-and-Play Motion Guidance for Diffusion Models.","authors":"Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Pengfei Wan, Tong-Yee Lee, Changsheng Xu","doi":"10.1109/TVCG.2025.3568880","DOIUrl":null,"url":null,"abstract":"<p><p>The essence of a video lies in the dynamic motions. While text-to-video generative diffusion models have made significant strides in creating diverse content, effectively controlling specific motions through text prompts remains a challenge. By utilizing user-specified reference videos, the more precise guidance for character actions, object movements, and camera movements can be achieved. This gives rise to the task of motion customization, where the primary challenge lies in effectively decoupling the appearance and motion within a video clip. To address this challenge, we introduce MotionCrafter, a novel one-shot instance-guided motion customization method that is suitable for both pre-trained text-to-video and text-to-image diffusion models. MotionCrafter employs a parallel spatial-temporal architecture that integrates the reference motion into the temporal component of the base model, while independently adjusting the spatial module for character or style control. To enhance the disentanglement of motion and appearance, we propose an innovative dual-branch motion disentanglement approach, which includes a motion disentanglement loss and an appearance prior enhancement strategy. To facilitate more efficient learning of motions, we further propose a novel timestep-layered tuning strategy that directs the diffusion model to focus on motion-level information. Through comprehensive quantitative and qualitative experiments, along with user preference tests, we demonstrate that MotionCrafter can successfully integrate dynamic motions while maintaining the coherence and quality of the base model, providing a wide range of appearance generation capabilities. MotionCrafter can be applied to various personalized backbones in the community to generate videos with a variety of artistic styles.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3568880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The essence of a video lies in the dynamic motions. While text-to-video generative diffusion models have made significant strides in creating diverse content, effectively controlling specific motions through text prompts remains a challenge. By utilizing user-specified reference videos, the more precise guidance for character actions, object movements, and camera movements can be achieved. This gives rise to the task of motion customization, where the primary challenge lies in effectively decoupling the appearance and motion within a video clip. To address this challenge, we introduce MotionCrafter, a novel one-shot instance-guided motion customization method that is suitable for both pre-trained text-to-video and text-to-image diffusion models. MotionCrafter employs a parallel spatial-temporal architecture that integrates the reference motion into the temporal component of the base model, while independently adjusting the spatial module for character or style control. To enhance the disentanglement of motion and appearance, we propose an innovative dual-branch motion disentanglement approach, which includes a motion disentanglement loss and an appearance prior enhancement strategy. To facilitate more efficient learning of motions, we further propose a novel timestep-layered tuning strategy that directs the diffusion model to focus on motion-level information. Through comprehensive quantitative and qualitative experiments, along with user preference tests, we demonstrate that MotionCrafter can successfully integrate dynamic motions while maintaining the coherence and quality of the base model, providing a wide range of appearance generation capabilities. MotionCrafter can be applied to various personalized backbones in the community to generate videos with a variety of artistic styles.