ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE

Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak
{"title":"ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE","authors":"Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak","doi":"arxiv-2409.07966","DOIUrl":null,"url":null,"abstract":"Audio-driven 3D facial animation synthesis has been an active field of\nresearch with attention from both academia and industry. While there are\npromising results in this area, recent approaches largely focus on lip-sync and\nidentity control, neglecting the role of emotions and emotion control in the\ngenerative process. That is mainly due to the lack of emotionally rich facial\nanimation data and algorithms that can synthesize speech animations with\nemotional expressions at the same time. In addition, majority of the models are\ndeterministic, meaning given the same audio input, they produce the same output\nmotion. We argue that emotions and non-determinism are crucial to generate\ndiverse and emotionally-rich facial animations. In this paper, we propose\nProbTalk3D a non-deterministic neural network approach for emotion controllable\nspeech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and\nan emotionally rich facial animation dataset 3DMEAD. We provide an extensive\ncomparative analysis of our model against the recent 3D facial animation\nsynthesis approaches, by evaluating the results objectively, qualitatively, and\nwith a perceptual user study. We highlight several objective metrics that are\nmore suitable for evaluating stochastic outputs and use both in-the-wild and\nground truth data for subjective evaluation. To our knowledge, that is the\nfirst non-deterministic 3D facial animation synthesis method incorporating a\nrich emotion dataset and emotion control with emotion labels and intensity\nlevels. Our evaluation demonstrates that the proposed model achieves superior\nperformance compared to state-of-the-art emotion-controlled, deterministic and\nnon-deterministic models. We recommend watching the supplementary video for\nquality judgement. The entire codebase is publicly available\n(https://github.com/uuembodiedsocialai/ProbTalk3D/).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement. The entire codebase is publicly available (https://github.com/uuembodiedsocialai/ProbTalk3D/).
ProbTalk3D:使用 VQ-VAE 进行非确定性情感可控语音驱动三维面部动画合成
音频驱动的三维面部动画合成一直是一个活跃的研究领域,受到学术界和工业界的关注。虽然在这一领域取得了令人鼓舞的成果,但最近的研究方法主要集中在唇部同步和身份控制上,而忽视了情感和情感控制在合成过程中的作用。这主要是由于缺乏情感丰富的面部动画数据和能同时合成具有情感表达的语音动画的算法。此外,大多数模型都是确定性的,即给定相同的音频输入,它们会产生相同的输出动作。我们认为,情感和非确定性对于生成多样化和情感丰富的面部动画至关重要。在本文中,我们使用两阶段 VQ-VAE 模型和情感丰富的面部动画数据集 3DMEAD,提出了一种用于情感可控语音驱动三维面部动画合成的非确定性神经网络方法 ProbTalk3D。我们通过对结果进行客观、定性和用户感知研究评估,对我们的模型与最近的三维面部动画合成方法进行了广泛的比较分析。我们强调了几个更适合评估随机输出的客观指标,并使用野外和地面真实数据进行主观评估。据我们所知,这是第一种非确定性三维面部动画合成方法,其中包含丰富的情感数据集以及带有情感标签和强度级别的情感控制。我们的评估结果表明,与最先进的情感控制、确定性和非确定性模型相比,所提出的模型具有更出色的性能。我们建议您观看补充视频以进行质量判断。整个代码库均可公开获取(https://github.com/uuembodiedsocialai/ProbTalk3D/)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信