{"title":"Text-Driven High-Quality 3D Human Generation via Variational Gradient Estimation and Latent Reward Models","authors":"Pengfei Zhou, Xukun Shen, Yong Hu","doi":"10.1002/cav.70089","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Recent advances in Score Distillation Sampling (SDS) have enabled text-driven 3D human generation, yet the standard classifier-free guidance (CFG) framework struggles with semantic misalignment and texture oversaturation due to limited model capacity. We propose a novel framework that decouples conditional and unconditional guidance via a dual-model strategy: A pretrained diffusion model ensures geometric stability, while a preference-tuned latent reward model enhances semantic fidelity. To further refine noise estimation, we introduce a lightweight U-shaped Swin Transformer (U-Swin) that regularizes predicted noise against the reward model, reducing gradient bias and local artifacts. Additionally, we design a time-varying noise weighting mechanism to dynamically balance the two guidance signals during denoising, improving stability and texture realism. Extensive experiments show that our method significantly improves alignment with textual descriptions, enhances texture details, and outperforms state-of-the-art baselines in both visual quality and semantic consistency.</p>\n </div>","PeriodicalId":50645,"journal":{"name":"Computer Animation and Virtual Worlds","volume":"37 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Animation and Virtual Worlds","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cav.70089","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advances in Score Distillation Sampling (SDS) have enabled text-driven 3D human generation, yet the standard classifier-free guidance (CFG) framework struggles with semantic misalignment and texture oversaturation due to limited model capacity. We propose a novel framework that decouples conditional and unconditional guidance via a dual-model strategy: A pretrained diffusion model ensures geometric stability, while a preference-tuned latent reward model enhances semantic fidelity. To further refine noise estimation, we introduce a lightweight U-shaped Swin Transformer (U-Swin) that regularizes predicted noise against the reward model, reducing gradient bias and local artifacts. Additionally, we design a time-varying noise weighting mechanism to dynamically balance the two guidance signals during denoising, improving stability and texture realism. Extensive experiments show that our method significantly improves alignment with textual descriptions, enhances texture details, and outperforms state-of-the-art baselines in both visual quality and semantic consistency.
期刊介绍:
With the advent of very powerful PCs and high-end graphics cards, there has been an incredible development in Virtual Worlds, real-time computer animation and simulation, games. But at the same time, new and cheaper Virtual Reality devices have appeared allowing an interaction with these real-time Virtual Worlds and even with real worlds through Augmented Reality. Three-dimensional characters, especially Virtual Humans are now of an exceptional quality, which allows to use them in the movie industry. But this is only a beginning, as with the development of Artificial Intelligence and Agent technology, these characters will become more and more autonomous and even intelligent. They will inhabit the Virtual Worlds in a Virtual Life together with animals and plants.