{"title":"ProbTalk3D:使用 VQ-VAE 进行非确定性情感可控语音驱动三维面部动画合成","authors":"Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak","doi":"arxiv-2409.07966","DOIUrl":null,"url":null,"abstract":"Audio-driven 3D facial animation synthesis has been an active field of\nresearch with attention from both academia and industry. While there are\npromising results in this area, recent approaches largely focus on lip-sync and\nidentity control, neglecting the role of emotions and emotion control in the\ngenerative process. That is mainly due to the lack of emotionally rich facial\nanimation data and algorithms that can synthesize speech animations with\nemotional expressions at the same time. In addition, majority of the models are\ndeterministic, meaning given the same audio input, they produce the same output\nmotion. We argue that emotions and non-determinism are crucial to generate\ndiverse and emotionally-rich facial animations. In this paper, we propose\nProbTalk3D a non-deterministic neural network approach for emotion controllable\nspeech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and\nan emotionally rich facial animation dataset 3DMEAD. We provide an extensive\ncomparative analysis of our model against the recent 3D facial animation\nsynthesis approaches, by evaluating the results objectively, qualitatively, and\nwith a perceptual user study. We highlight several objective metrics that are\nmore suitable for evaluating stochastic outputs and use both in-the-wild and\nground truth data for subjective evaluation. To our knowledge, that is the\nfirst non-deterministic 3D facial animation synthesis method incorporating a\nrich emotion dataset and emotion control with emotion labels and intensity\nlevels. Our evaluation demonstrates that the proposed model achieves superior\nperformance compared to state-of-the-art emotion-controlled, deterministic and\nnon-deterministic models. We recommend watching the supplementary video for\nquality judgement. The entire codebase is publicly available\n(https://github.com/uuembodiedsocialai/ProbTalk3D/).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE\",\"authors\":\"Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak\",\"doi\":\"arxiv-2409.07966\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio-driven 3D facial animation synthesis has been an active field of\\nresearch with attention from both academia and industry. While there are\\npromising results in this area, recent approaches largely focus on lip-sync and\\nidentity control, neglecting the role of emotions and emotion control in the\\ngenerative process. That is mainly due to the lack of emotionally rich facial\\nanimation data and algorithms that can synthesize speech animations with\\nemotional expressions at the same time. In addition, majority of the models are\\ndeterministic, meaning given the same audio input, they produce the same output\\nmotion. We argue that emotions and non-determinism are crucial to generate\\ndiverse and emotionally-rich facial animations. In this paper, we propose\\nProbTalk3D a non-deterministic neural network approach for emotion controllable\\nspeech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and\\nan emotionally rich facial animation dataset 3DMEAD. We provide an extensive\\ncomparative analysis of our model against the recent 3D facial animation\\nsynthesis approaches, by evaluating the results objectively, qualitatively, and\\nwith a perceptual user study. We highlight several objective metrics that are\\nmore suitable for evaluating stochastic outputs and use both in-the-wild and\\nground truth data for subjective evaluation. To our knowledge, that is the\\nfirst non-deterministic 3D facial animation synthesis method incorporating a\\nrich emotion dataset and emotion control with emotion labels and intensity\\nlevels. Our evaluation demonstrates that the proposed model achieves superior\\nperformance compared to state-of-the-art emotion-controlled, deterministic and\\nnon-deterministic models. We recommend watching the supplementary video for\\nquality judgement. The entire codebase is publicly available\\n(https://github.com/uuembodiedsocialai/ProbTalk3D/).\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07966\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE
Audio-driven 3D facial animation synthesis has been an active field of
research with attention from both academia and industry. While there are
promising results in this area, recent approaches largely focus on lip-sync and
identity control, neglecting the role of emotions and emotion control in the
generative process. That is mainly due to the lack of emotionally rich facial
animation data and algorithms that can synthesize speech animations with
emotional expressions at the same time. In addition, majority of the models are
deterministic, meaning given the same audio input, they produce the same output
motion. We argue that emotions and non-determinism are crucial to generate
diverse and emotionally-rich facial animations. In this paper, we propose
ProbTalk3D a non-deterministic neural network approach for emotion controllable
speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and
an emotionally rich facial animation dataset 3DMEAD. We provide an extensive
comparative analysis of our model against the recent 3D facial animation
synthesis approaches, by evaluating the results objectively, qualitatively, and
with a perceptual user study. We highlight several objective metrics that are
more suitable for evaluating stochastic outputs and use both in-the-wild and
ground truth data for subjective evaluation. To our knowledge, that is the
first non-deterministic 3D facial animation synthesis method incorporating a
rich emotion dataset and emotion control with emotion labels and intensity
levels. Our evaluation demonstrates that the proposed model achieves superior
performance compared to state-of-the-art emotion-controlled, deterministic and
non-deterministic models. We recommend watching the supplementary video for
quality judgement. The entire codebase is publicly available
(https://github.com/uuembodiedsocialai/ProbTalk3D/).