Zhangsihao Yang, Mengyi Shan, Mohammad Farazi, Wenhui Zhu, Yanxi Chen, Xuanzhao Dong, Yalin Wang
{"title":"AMG:阿凡达动作引导视频生成器","authors":"Zhangsihao Yang, Mengyi Shan, Mohammad Farazi, Wenhui Zhu, Yanxi Chen, Xuanzhao Dong, Yalin Wang","doi":"arxiv-2409.01502","DOIUrl":null,"url":null,"abstract":"Human video generation task has gained significant attention with the\nadvancement of deep generative models. Generating realistic videos with human\nmovements is challenging in nature, due to the intricacies of human body\ntopology and sensitivity to visual artifacts. The extensively studied 2D media\ngeneration methods take advantage of massive human media datasets, but struggle\nwith 3D-aware control; whereas 3D avatar-based approaches, while offering more\nfreedom in control, lack photorealism and cannot be harmonized seamlessly with\nbackground scene. We propose AMG, a method that combines the 2D photorealism\nand 3D controllability by conditioning video diffusion models on controlled\nrendering of 3D avatars. We additionally introduce a novel data processing\npipeline that reconstructs and renders human avatar movements from dynamic\ncamera videos. AMG is the first method that enables multi-person diffusion\nvideo generation with precise control over camera positions, human motions, and\nbackground style. We also demonstrate through extensive evaluation that it\noutperforms existing human video generation methods conditioned on pose\nsequences or driving videos in terms of realism and adaptability.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"136 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AMG: Avatar Motion Guided Video Generation\",\"authors\":\"Zhangsihao Yang, Mengyi Shan, Mohammad Farazi, Wenhui Zhu, Yanxi Chen, Xuanzhao Dong, Yalin Wang\",\"doi\":\"arxiv-2409.01502\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human video generation task has gained significant attention with the\\nadvancement of deep generative models. Generating realistic videos with human\\nmovements is challenging in nature, due to the intricacies of human body\\ntopology and sensitivity to visual artifacts. The extensively studied 2D media\\ngeneration methods take advantage of massive human media datasets, but struggle\\nwith 3D-aware control; whereas 3D avatar-based approaches, while offering more\\nfreedom in control, lack photorealism and cannot be harmonized seamlessly with\\nbackground scene. We propose AMG, a method that combines the 2D photorealism\\nand 3D controllability by conditioning video diffusion models on controlled\\nrendering of 3D avatars. We additionally introduce a novel data processing\\npipeline that reconstructs and renders human avatar movements from dynamic\\ncamera videos. AMG is the first method that enables multi-person diffusion\\nvideo generation with precise control over camera positions, human motions, and\\nbackground style. We also demonstrate through extensive evaluation that it\\noutperforms existing human video generation methods conditioned on pose\\nsequences or driving videos in terms of realism and adaptability.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":\"136 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01502\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01502","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Human video generation task has gained significant attention with the
advancement of deep generative models. Generating realistic videos with human
movements is challenging in nature, due to the intricacies of human body
topology and sensitivity to visual artifacts. The extensively studied 2D media
generation methods take advantage of massive human media datasets, but struggle
with 3D-aware control; whereas 3D avatar-based approaches, while offering more
freedom in control, lack photorealism and cannot be harmonized seamlessly with
background scene. We propose AMG, a method that combines the 2D photorealism
and 3D controllability by conditioning video diffusion models on controlled
rendering of 3D avatars. We additionally introduce a novel data processing
pipeline that reconstructs and renders human avatar movements from dynamic
camera videos. AMG is the first method that enables multi-person diffusion
video generation with precise control over camera positions, human motions, and
background style. We also demonstrate through extensive evaluation that it
outperforms existing human video generation methods conditioned on pose
sequences or driving videos in terms of realism and adaptability.