Emotional 3D speech visualization from 2D audio visual data

IF 0.9 Q3 COMPUTER SCIENCE, THEORY & METHODS

International Journal of Modeling Simulation and Scientific Computing Pub Date : 2022-11-26 DOI:10.1142/s1793962324500028

Luis Guillermo, Jose-Maria Rojas, W. Ugarte

{"title":"Emotional 3D speech visualization from 2D audio visual data","authors":"Luis Guillermo, Jose-Maria Rojas, W. Ugarte","doi":"10.1142/s1793962324500028","DOIUrl":null,"url":null,"abstract":"Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accelerate the animation of characters faces, as procedural animation or speech-lip synchronization, where the most common areas for researching these methods are Computer Vision and Machine Learning. However, in general, these tools can have any of these main problems: difficulty on adapting to another language, subject or animation software, high hardware specifications, or the results can be receipted as robotic. Our work presents a Deep Learning model for automatic expressive facial animation using audio. We extract generic audio features from expressive audio speeches rich in phonemes for nonidiom focus speech processing and emotion recognition. From videos used for training, we extracted the landmarks for frame-speech targeting and have the model learn animation for phonemes pronunciation. We evaluated four variants of our model (two function losses and with emotion conditioning) by using a user perspective survey where the one using a Reconstruction Loss Function with emotion training conditioning got more natural results and score in synchronization with the approval of the majority of interviewees. For perception of naturalness, it obtained a 38.89% of the total votes of approval and for language synchronization obtained the highest average score with 65.55% (98.33 of a 150 total points) for English, German and Korean languages.","PeriodicalId":45889,"journal":{"name":"International Journal of Modeling Simulation and Scientific Computing","volume":"6 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2022-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Modeling Simulation and Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s1793962324500028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accelerate the animation of characters faces, as procedural animation or speech-lip synchronization, where the most common areas for researching these methods are Computer Vision and Machine Learning. However, in general, these tools can have any of these main problems: difficulty on adapting to another language, subject or animation software, high hardware specifications, or the results can be receipted as robotic. Our work presents a Deep Learning model for automatic expressive facial animation using audio. We extract generic audio features from expressive audio speeches rich in phonemes for nonidiom focus speech processing and emotion recognition. From videos used for training, we extracted the landmarks for frame-speech targeting and have the model learn animation for phonemes pronunciation. We evaluated four variants of our model (two function losses and with emotion conditioning) by using a user perspective survey where the one using a Reconstruction Loss Function with emotion training conditioning got more natural results and score in synchronization with the approval of the majority of interviewees. For perception of naturalness, it obtained a 38.89% of the total votes of approval and for language synchronization obtained the highest average score with 65.55% (98.33 of a 150 total points) for English, German and Korean languages.

查看原文本刊更多论文

基于2D视听数据的情感三维语音可视化

视觉语言很难用人手来重现，因为动画本身是一项耗时的任务:必须考虑精度和细节，并符合开发者的期望，但最重要的是，符合观众的期望。为了解决这个问题，已经设计了一些方法来帮助加速角色面部的动画，如程序动画或言语-嘴唇同步，其中研究这些方法最常见的领域是计算机视觉和机器学习。然而，一般来说，这些工具可能存在以下主要问题:难以适应另一种语言、主题或动画软件、高硬件规格，或者结果可能被认为是机器人。我们的工作提出了一个深度学习模型，用于使用音频自动表达面部动画。我们从具有丰富音素的表达性语音中提取通用语音特征，用于非习语焦点语音处理和情感识别。从用于训练的视频中，我们提取了框架语音定位的标志，并让模型学习音素发音的动画。我们通过使用用户视角调查评估了我们模型的四种变体(两种功能损失和带有情绪调节)，其中使用带有情绪训练条件的重建损失函数的模型得到了更自然的结果，并且与大多数受访者的认可同步得分。在“自然感”方面，获得了38.89%的赞成率。在“语言同步性”方面，英语、德语、韩语的平均分为65.55%(总分150分，98.33分)，获得了最高的分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Modeling Simulation and Scientific Computing COMPUTER SCIENCE, THEORY & METHODS-

CiteScore

2.50

自引率

16.70%

发文量