ProbTalk3D-X: Prosody enhanced non-deterministic emotion controllable speech-driven 3D facial animation synthesis

IF 2.8 4区计算机科学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computers & Graphics-Uk Pub Date : 2025-09-08 DOI:10.1016/j.cag.2025.104358

Kazi Injamamul Haque, Sichun Wu, Zerrin Yumak

{"title":"ProbTalk3D-X: Prosody enhanced non-deterministic emotion controllable speech-driven 3D facial animation synthesis","authors":"Kazi Injamamul Haque, Sichun Wu, Zerrin Yumak","doi":"10.1016/j.cag.2025.104358","DOIUrl":null,"url":null,"abstract":"<div><div>Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, the majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we present ProbTalk3D-X by extending a prior work ProbTalk3D- a two staged VQ-VAE based non-deterministic model, by additionally incorporating prosody features for improved facial accuracy using an emotionally rich facial animation dataset, 3DMEAD. Further, we present a comprehensive comparison of non-deterministic emotion controllable models (including new extended experimental models) leveraging VQ-VAE, VAE and diffusion techniques. We provide an extensive comparative analysis of the experimental models against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. Our evaluation demonstrates that ProbTalk3D-X and original ProbTalk3D achieve superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for visual quality judgment. The entire codebase including the extended models is publicly available.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"132 ","pages":"Article 104358"},"PeriodicalIF":2.8000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Graphics-Uk","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0097849325001992","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, the majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we present ProbTalk3D-X by extending a prior work ProbTalk3D- a two staged VQ-VAE based non-deterministic model, by additionally incorporating prosody features for improved facial accuracy using an emotionally rich facial animation dataset, 3DMEAD. Further, we present a comprehensive comparison of non-deterministic emotion controllable models (including new extended experimental models) leveraging VQ-VAE, VAE and diffusion techniques. We provide an extensive comparative analysis of the experimental models against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. Our evaluation demonstrates that ProbTalk3D-X and original ProbTalk3D achieve superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for visual quality judgment. The entire codebase including the extended models is publicly available.¹

Abstract Image

查看原文本刊更多论文

ProbTalk3D-X：韵律增强的非确定性情绪可控语音驱动的3D面部动画合成

音频驱动的三维面部动画合成一直是学术界和工业界关注的一个活跃的研究领域。虽然在这一领域有很好的结果，但最近的方法主要集中在对口型和身份控制上，忽视了情绪和情绪控制在生成过程中的作用。这主要是由于缺乏情感丰富的面部动画数据和能够同时将语音动画与情感表达合成的算法。此外，大多数模型都是确定性的，这意味着给定相同的音频输入，它们会产生相同的输出运动。我们认为情绪和非决定论是产生多样化和情感丰富的面部动画的关键。在本文中，我们通过扩展先前的工作ProbTalk3D-一个基于两阶段VQ-VAE的不确定性模型，通过使用情感丰富的面部动画数据集3DMEAD，另外结合韵律特征来提高面部准确性，提出了ProbTalk3D- x。此外，我们还全面比较了利用VQ-VAE、VAE和扩散技术的非确定性情绪可控模型（包括新的扩展实验模型）。我们通过客观、定性地评估结果，并通过感知用户研究，对实验模型与最近的3D面部动画合成方法进行了广泛的比较分析。我们强调了几个更适合评估随机输出的客观指标，并使用野外和地面真实数据进行主观评估。我们的评估表明，与最先进的情绪控制、确定性和非确定性模型相比，ProbTalk3D- x和原始ProbTalk3D具有优越的性能。我们建议观看补充视频进行视觉质量判断。包括扩展模型在内的整个代码库都是公开可用的

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Graphics-Uk 工程技术-计算机：软件工程

CiteScore

5.30

自引率

12.00%

发文量

173

审稿时长

38 days

期刊介绍： Computers & Graphics is dedicated to disseminate information on research and applications of computer graphics (CG) techniques. The journal encourages articles on: 1. Research and applications of interactive computer graphics. We are particularly interested in novel interaction techniques and applications of CG to problem domains. 2. State-of-the-art papers on late-breaking, cutting-edge research on CG. 3. Information on innovative uses of graphics principles and technologies. 4. Tutorial papers on both teaching CG principles and innovative uses of CG in education.