从视频中学习语音驱动的3D会话手势

Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents Pub Date : 2021-02-13 DOI:10.1145/3472306.3478335

I. Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, H. Seidel, Gerard Pons-Moll, Mohamed A. Elgharib, C. Theobalt

{"title":"从视频中学习语音驱动的3D会话手势","authors":"I. Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, H. Seidel, Gerard Pons-Moll, Mohamed A. Elgharib, C. Theobalt","doi":"10.1145/3472306.3478335","DOIUrl":null,"url":null,"abstract":"We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.","PeriodicalId":148152,"journal":{"name":"Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents","volume":"269 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":"{\"title\":\"Learning Speech-driven 3D Conversational Gestures from Video\",\"authors\":\"I. Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, H. Seidel, Gerard Pons-Moll, Mohamed A. Elgharib, C. Theobalt\",\"doi\":\"10.1145/3472306.3478335\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.\",\"PeriodicalId\":148152,\"journal\":{\"name\":\"Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents\",\"volume\":\"269 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"53\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3472306.3478335\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3472306.3478335","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 53

摘要

我们提出了第一种方法，从语音输入合成虚拟角色的同步3D会话身体和手势，以及3D面部和头部动画。我们的算法使用CNN架构，利用面部表情和手势之间的内在相关性。会话肢体手势的合成是一个多模态问题，因为许多相似的手势可能伴随着相同的输入语音。为了在这种情况下合成合理的身体手势，我们训练了一个基于生成对抗网络(GAN)的模型，该模型在与输入音频特征配对时测量生成的3D身体运动序列的合理性。我们还提供了一个新的语料库，其中包含超过33小时的注释数据，这些数据来自于说话的人的野外视频。为此，我们将最先进的单眼方法应用于3D身体和手部姿势估计以及3D面部表现捕获视频语料库。通过这种方式，我们可以训练比以前的算法更多的数量级数据，这些算法诉诸于复杂的工作室内动作捕捉解决方案，从而训练更具表现力的合成算法。我们的实验和用户研究表明，我们的语音合成全3D角色动画的质量是最先进的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning Speech-driven 3D Conversational Gestures from Video

We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents

自引率

0.00%

发文量