Yihao Zhi, Wanhu Sun, Jiahao Chang, Chongjie Ye, Wensen Feng, Xiaoguang Han
{"title":"StruGauAvatar: Learning Structured 3D Gaussians for Animatable Avatars from Monocular Videos.","authors":"Yihao Zhi, Wanhu Sun, Jiahao Chang, Chongjie Ye, Wensen Feng, Xiaoguang Han","doi":"10.1109/TVCG.2025.3557457","DOIUrl":null,"url":null,"abstract":"<p><p>In recent years, significant progress has been witnessed in the field of neural 3D avatar reconstruction. Among all related tasks, building an animatable avatar from monocular videos is one of the most challenging ones, yet it also has a wide range of applications. The \"animatable\" means that we need to transfer any arbitrary and unseen poses onto the avatar and generate new 3D videos. Thanks to the rise of the powerful representation of NeRF, generating a high-fidelity animatable avatar from videos has become easier and more accessible. Despite their impressive visual results, the substantial training and rendering overhead dramatically hamper their applications. 3D Gaussian Splatting, as a timely new representation, has demonstrated its high-quality and high-efficiency rendering. This has led to many concurrent works to introduce 3D-GS to animatable avatar building. Although they demonstrate very high-fidelity renderings for poses similar to the training video frames, poor results are produced when the poses are far from training. We argue that this is primarily because the Gaussian points lack structures. Thus, we suggest involving DMTet to represent the coarse geometry of the avatar. In our representation, the majority of Gaussian points are bound to the mesh vertices, while some free Gaussian is allowed to expand to better fit the given video. Furthermore, we develop a dual-space optimization framework to jointly optimize the DMTet, Gaussian points, and skinning weights under two spaces. In this sense, Gaussian points are deformed in a constrained way, which dramatically improves the generalization ability for unseen poses. This is well demonstrated via extensive experiments.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3557457","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, significant progress has been witnessed in the field of neural 3D avatar reconstruction. Among all related tasks, building an animatable avatar from monocular videos is one of the most challenging ones, yet it also has a wide range of applications. The "animatable" means that we need to transfer any arbitrary and unseen poses onto the avatar and generate new 3D videos. Thanks to the rise of the powerful representation of NeRF, generating a high-fidelity animatable avatar from videos has become easier and more accessible. Despite their impressive visual results, the substantial training and rendering overhead dramatically hamper their applications. 3D Gaussian Splatting, as a timely new representation, has demonstrated its high-quality and high-efficiency rendering. This has led to many concurrent works to introduce 3D-GS to animatable avatar building. Although they demonstrate very high-fidelity renderings for poses similar to the training video frames, poor results are produced when the poses are far from training. We argue that this is primarily because the Gaussian points lack structures. Thus, we suggest involving DMTet to represent the coarse geometry of the avatar. In our representation, the majority of Gaussian points are bound to the mesh vertices, while some free Gaussian is allowed to expand to better fit the given video. Furthermore, we develop a dual-space optimization framework to jointly optimize the DMTet, Gaussian points, and skinning weights under two spaces. In this sense, Gaussian points are deformed in a constrained way, which dramatically improves the generalization ability for unseen poses. This is well demonstrated via extensive experiments.