使用主动外观模型表达视觉文本到语音

2013 IEEE Conference on Computer Vision and Pattern Recognition Pub Date : 2013-06-23 DOI:10.1109/CVPR.2013.434

Robert Anderson, B. Stenger, V. Wan, R. Cipolla

{"title":"使用主动外观模型表达视觉文本到语音","authors":"Robert Anderson, B. Stenger, V. Wan, R. Cipolla","doi":"10.1109/CVPR.2013.434","DOIUrl":null,"url":null,"abstract":"This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a 'talking head', given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.","PeriodicalId":6343,"journal":{"name":"2013 IEEE Conference on Computer Vision and Pattern Recognition","volume":"523 1","pages":"3382-3389"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"84","resultStr":"{\"title\":\"Expressive Visual Text-to-Speech Using Active Appearance Models\",\"authors\":\"Robert Anderson, B. Stenger, V. Wan, R. Cipolla\",\"doi\":\"10.1109/CVPR.2013.434\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a 'talking head', given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.\",\"PeriodicalId\":6343,\"journal\":{\"name\":\"2013 IEEE Conference on Computer Vision and Pattern Recognition\",\"volume\":\"523 1\",\"pages\":\"3382-3389\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"84\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE Conference on Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPR.2013.434\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Conference on Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2013.434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 84

摘要

本文提出了一个完整的视觉文本到语音(VTTS)系统，该系统能够在给定输入文本和一组连续表达权重的情况下，以“说话的头”的形式产生富有表现力的输出。采用主动外观模型(AAM)对人脸进行建模，并对其进行了扩展，使其更适用于VTTS任务。该模型允许对姿态和眨眼状态进行归一化，从而显著减少合成序列中的伪影。我们展示了在超过一百万帧的重建误差方面的定量改进，以及在大规模用户研究中，比较不同系统的输出。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Expressive Visual Text-to-Speech Using Active Appearance Models

This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a 'talking head', given an input text and a set of continuous expression weights. The face is modeled using an active appearance model (AAM), and several extensions are proposed which make it more applicable to the task of VTTS. The model allows for normalization with respect to both pose and blink state which significantly reduces artifacts in the resulting synthesized sequences. We demonstrate quantitative improvements in terms of reconstruction error over a million frames, as well as in large-scale user studies, comparing the output of different systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE Conference on Computer Vision and Pattern Recognition

自引率

0.00%

发文量