基于语音的3D面部动画端到端学习

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI:10.1145/3242969.3243017

Hai Xuan Pham, Yuting Wang, V. Pavlovic

{"title":"基于语音的3D面部动画端到端学习","authors":"Hai Xuan Pham, Yuting Wang, V. Pavlovic","doi":"10.1145/3242969.3243017","DOIUrl":null,"url":null,"abstract":"We present a deep learning framework for real-time speech-driven 3D facial animation from speech audio. Our deep neural network directly maps an input sequence of speech spectrograms to a series of micro facial action unit intensities to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or both eyebrows raise higher in a surprised state. Experiments on diverse audiovisual corpora of different actors across a wide range of facial actions and emotional states show promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"End-to-end Learning for 3D Facial Animation from Speech\",\"authors\":\"Hai Xuan Pham, Yuting Wang, V. Pavlovic\",\"doi\":\"10.1145/3242969.3243017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a deep learning framework for real-time speech-driven 3D facial animation from speech audio. Our deep neural network directly maps an input sequence of speech spectrograms to a series of micro facial action unit intensities to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or both eyebrows raise higher in a surprised state. Experiments on diverse audiovisual corpora of different actors across a wide range of facial actions and emotional states show promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation.\",\"PeriodicalId\":308751,\"journal\":{\"name\":\"Proceedings of the 20th ACM International Conference on Multimodal Interaction\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 20th ACM International Conference on Multimodal Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3242969.3243017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3242969.3243017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

摘要

我们提出了一个深度学习框架，用于实时语音驱动的语音音频3D面部动画。我们的深度神经网络直接将语音谱图的输入序列映射到一系列微面部动作单元强度，以驱动3D混合型面部模型。特别是，我们的深度模型能够学习语音中随时间变化的上下文信息和情感状态的潜在表征。因此，我们的模型不仅在推理时激活适当的面部动作单元，以嘴唇运动的形式描绘不同的话语生成动作，而且在没有任何假设的情况下，通过调整相关面部单元激活的强度，自动估计说话者的情绪强度，并再现她不断变化的情感状态。例如，在愉快的演讲中，嘴张得比平时大，而其他面部单位则放松;或者眉毛扬得更高，表现出惊讶的状态。在不同演员的各种视听语料库上进行的各种面部动作和情绪状态的实验显示了我们的方法有希望的结果。由于与说话人无关，我们的广义模型可以很容易地应用于人机交互和动画中的各种任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

End-to-end Learning for 3D Facial Animation from Speech

We present a deep learning framework for real-time speech-driven 3D facial animation from speech audio. Our deep neural network directly maps an input sequence of speech spectrograms to a series of micro facial action unit intensities to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or both eyebrows raise higher in a surprised state. Experiments on diverse audiovisual corpora of different actors across a wide range of facial actions and emotional states show promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 20th ACM International Conference on Multimodal Interaction

自引率

0.00%

发文量