THGS: Lifelike Talking Human Avatar Synthesis From Monocular Video Via 3D Gaussian Splatting

IF 2.7 4区计算机科学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computer Graphics Forum Pub Date : 2025-01-25 DOI:10.1111/cgf.15282

Chuang Chen, Lingyun Yu, Quanwei Yang, Aihua Zheng, Hongtao Xie

{"title":"THGS: Lifelike Talking Human Avatar Synthesis From Monocular Video Via 3D Gaussian Splatting","authors":"Chuang Chen, Lingyun Yu, Quanwei Yang, Aihua Zheng, Hongtao Xie","doi":"10.1111/cgf.15282","DOIUrl":null,"url":null,"abstract":"Despite the remarkable progress in 3D talking head generation, directly generating 3D talking human avatars still suffers from rigid facial expressions, distorted hand textures and out-of-sync lip movements. In this paper, we extend speaker-specific talking head generation task to talking human avatar synthesis and propose a novel pipeline, THGS, that animates lifelike Talking Human avatars using 3D Gaussian Splatting (3DGS). Given speech audio, expression and body poses as input, THGS effectively overcomes the limitations of 3DGS human re-construction methods in capturing expressive dynamics, such as mouth movements, facial expressions and hand gestures, from a short monocular video. Firstly, we introduce a simple yet effective Learnable Expression Blendshapes (LEB) for facial dynamics re-construction, where subtle facial dynamics can be generated by linearly combining the static head model and expression blendshapes. Secondly, a Spatial Audio Attention Module (SAAM) is proposed for lip-synced mouth movement animation, building connections between speech audio and mouth Gaussian movements. Thirdly, we employ a body pose, expression and skinning weights joint optimization strategy to optimize these parameters on the fly, which aligns hand movements and expressions better with video input. Experimental results demonstrate that THGS can achieve high-fidelity 3D talking human avatar animation at 150+ fps on a web-based rendering system, improving the requirements of real-time applications. Our project page is at https://sora158.github.io/THGS.github.io/.","PeriodicalId":10687,"journal":{"name":"Computer Graphics Forum","volume":"44 1","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2025-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Graphics Forum","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/cgf.15282","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Despite the remarkable progress in 3D talking head generation, directly generating 3D talking human avatars still suffers from rigid facial expressions, distorted hand textures and out-of-sync lip movements. In this paper, we extend speaker-specific talking head generation task to talking human avatar synthesis and propose a novel pipeline, THGS, that animates lifelike Talking Human avatars using 3D Gaussian Splatting (3DGS). Given speech audio, expression and body poses as input, THGS effectively overcomes the limitations of 3DGS human re-construction methods in capturing expressive dynamics, such as mouth movements, facial expressions and hand gestures, from a short monocular video. Firstly, we introduce a simple yet effective Learnable Expression Blendshapes (LEB) for facial dynamics re-construction, where subtle facial dynamics can be generated by linearly combining the static head model and expression blendshapes. Secondly, a Spatial Audio Attention Module (SAAM) is proposed for lip-synced mouth movement animation, building connections between speech audio and mouth Gaussian movements. Thirdly, we employ a body pose, expression and skinning weights joint optimization strategy to optimize these parameters on the fly, which aligns hand movements and expressions better with video input. Experimental results demonstrate that THGS can achieve high-fidelity 3D talking human avatar animation at 150+ fps on a web-based rendering system, improving the requirements of real-time applications. Our project page is at https://sora158.github.io/THGS.github.io/.

Abstract Image

查看原文本刊更多论文

THGS：通过3D高斯溅射从单目视频合成逼真的会说话的人类化身

尽管在3D说话头像生成方面取得了显著进展，但直接生成3D说话的人类化身仍然存在面部表情僵硬、手部纹理扭曲和嘴唇运动不同步的问题。在本文中，我们将特定于说话人的说话头生成任务扩展到会说话的人类化身合成，并提出了一个新的管道，THGS，使用3D高斯飞溅（3DGS）来动画逼真的会说话的人类化身。以语音音频、表情和身体姿势为输入，THGS有效地克服了3DGS人体重建方法在捕捉单目短视频中嘴部运动、面部表情和手势等表达动态方面的局限性。首先，我们引入了一种简单而有效的面部动态重建的可学习的表情混合形状（LEB），其中可以通过将静态头部模型和表情混合形状线性结合来生成微妙的面部动态。其次，针对口型同步动画，提出了空间音频注意模块（Spatial Audio Attention Module， SAAM），建立了语音音频与口型高斯运动之间的联系。再次，我们采用身体姿态、表情和皮肤权重联合优化策略对这些参数进行动态优化，使手部动作和表情与视频输入更好地保持一致。实验结果表明，THGS可以在基于web的渲染系统上以150+ fps的速度实现高保真的3D有声人物动画，提高了实时性应用的要求。我们的项目页面在https://sora158.github.io/THGS.github.io/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Graphics Forum 工程技术-计算机：软件工程

CiteScore

5.80

自引率

12.00%

发文量

175

审稿时长

3-6 weeks

期刊介绍： Computer Graphics Forum is the official journal of Eurographics, published in cooperation with Wiley-Blackwell, and is a unique, international source of information for computer graphics professionals interested in graphics developments worldwide. It is now one of the leading journals for researchers, developers and users of computer graphics in both commercial and academic environments. The journal reports on the latest developments in the field throughout the world and covers all aspects of the theory, practice and application of computer graphics.