从视频中学习语音驱动的3D会话手势

I. Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, H. Seidel, Gerard Pons-Moll, Mohamed A. Elgharib, C. Theobalt
{"title":"从视频中学习语音驱动的3D会话手势","authors":"I. Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, H. Seidel, Gerard Pons-Moll, Mohamed A. Elgharib, C. Theobalt","doi":"10.1145/3472306.3478335","DOIUrl":null,"url":null,"abstract":"We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.","PeriodicalId":148152,"journal":{"name":"Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents","volume":"269 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":"{\"title\":\"Learning Speech-driven 3D Conversational Gestures from Video\",\"authors\":\"I. Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, H. Seidel, Gerard Pons-Moll, Mohamed A. Elgharib, C. Theobalt\",\"doi\":\"10.1145/3472306.3478335\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.\",\"PeriodicalId\":148152,\"journal\":{\"name\":\"Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents\",\"volume\":\"269 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"53\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3472306.3478335\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3472306.3478335","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 53

摘要

我们提出了第一种方法,从语音输入合成虚拟角色的同步3D会话身体和手势,以及3D面部和头部动画。我们的算法使用CNN架构,利用面部表情和手势之间的内在相关性。会话肢体手势的合成是一个多模态问题,因为许多相似的手势可能伴随着相同的输入语音。为了在这种情况下合成合理的身体手势,我们训练了一个基于生成对抗网络(GAN)的模型,该模型在与输入音频特征配对时测量生成的3D身体运动序列的合理性。我们还提供了一个新的语料库,其中包含超过33小时的注释数据,这些数据来自于说话的人的野外视频。为此,我们将最先进的单眼方法应用于3D身体和手部姿势估计以及3D面部表现捕获视频语料库。通过这种方式,我们可以训练比以前的算法更多的数量级数据,这些算法诉诸于复杂的工作室内动作捕捉解决方案,从而训练更具表现力的合成算法。我们的实验和用户研究表明,我们的语音合成全3D角色动画的质量是最先进的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Learning Speech-driven 3D Conversational Gestures from Video
We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信