由组合语音特征驱动的三维网格序列视觉语音合成

2017 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2017-07-01 DOI:10.1109/ICME.2017.8019546

Felix Kuhnke, J. Ostermann

{"title":"由组合语音特征驱动的三维网格序列视觉语音合成","authors":"Felix Kuhnke, J. Ostermann","doi":"10.1109/ICME.2017.8019546","DOIUrl":null,"url":null,"abstract":"Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature produces more expressive facial animation and it robustly handles temporal labeling errors. Furthermore, by employing a sliding window approach to feature extraction, the system is easy to train and allows for low-delay synthesis. We show that our novel combination of speech features improves visual speech synthesis. Our findings are confirmed by a subjective user study.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Visual speech synthesis from 3D mesh sequences driven by combined speech features\",\"authors\":\"Felix Kuhnke, J. Ostermann\",\"doi\":\"10.1109/ICME.2017.8019546\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature produces more expressive facial animation and it robustly handles temporal labeling errors. Furthermore, by employing a sliding window approach to feature extraction, the system is easy to train and allows for low-delay synthesis. We show that our novel combination of speech features improves visual speech synthesis. Our findings are confirmed by a subjective user study.\",\"PeriodicalId\":330977,\"journal\":{\"name\":\"2017 IEEE International Conference on Multimedia and Expo (ICME)\",\"volume\":\"102 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Multimedia and Expo (ICME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME.2017.8019546\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2017.8019546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

给定预注册的3D网格序列和伴随的音素标记音频，我们的系统创建了一个可动画的面部模型和一个映射程序，为任意语音输入生成逼真的语音动画。语音特征到模型参数的映射使用随机森林进行回归。我们提出了一种基于音位标签和声学特征的语音特征。该新特征产生了更具表现力的面部动画，并能鲁棒地处理时间标记错误。此外，通过采用滑动窗口方法进行特征提取，系统易于训练并允许低延迟合成。我们表明，我们的新语音特征组合提高了视觉语音合成。我们的发现得到了主观用户研究的证实。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Visual speech synthesis from 3D mesh sequences driven by combined speech features

Given a pre-registered 3D mesh sequence and accompanying phoneme-labeled audio, our system creates an animatable face model and a mapping procedure to produce realistic speech animations for arbitrary speech input. Mapping of speech features to model parameters is done using random forests for regression. We propose a new speech feature based on phonemic labels and acoustic features. The novel feature produces more expressive facial animation and it robustly handles temporal labeling errors. Furthermore, by employing a sliding window approach to feature extraction, the system is easy to train and allows for low-delay synthesis. We show that our novel combination of speech features improves visual speech synthesis. Our findings are confirmed by a subjective user study.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量