基于gan的正面视图映射改进Viseme识别

Dario Augusto Borges Oliveira, Andréa Britto Mattos, E. Morais
{"title":"基于gan的正面视图映射改进Viseme识别","authors":"Dario Augusto Borges Oliveira, Andréa Britto Mattos, E. Morais","doi":"10.1109/CVPRW.2018.00289","DOIUrl":null,"url":null,"abstract":"Deep learning methods have become the standard for Visual Speech Recognition problems due to their high accuracy results reported in the literature. However, while successful works have been reported for words and sentences, recognizing shorter segments of speech, like phones, has proven to be much more challenging due to the lack of temporal and contextual information. Also, head-pose variation remains a known issue for facial analysis with direct impact in this problem. In this context, we propose a novel methodology to tackle the problem of recognizing visemes – the visual equivalent of phonemes – using a GAN to artificially lock the face view into a perfect frontal view, reducing the view angle variability and simplifying the recognition task performed by our classification CNN. The GAN is trained using a large-scale synthetic 2D dataset based on realistic 3D facial models, automatically labelled for different visemes, mapping a slightly random view to a perfect frontal view. We evaluate our method using the GRID corpus, which was processed to extract viseme images and their corresponding synthetic frontal views to be further classified by our CNN model. Our results demonstrate that the additional synthetic frontal view is able to improve accuracy in 5.9% when compared with classification using the original image only.","PeriodicalId":150600,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Improving Viseme Recognition Using GAN-Based Frontal View Mapping\",\"authors\":\"Dario Augusto Borges Oliveira, Andréa Britto Mattos, E. Morais\",\"doi\":\"10.1109/CVPRW.2018.00289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning methods have become the standard for Visual Speech Recognition problems due to their high accuracy results reported in the literature. However, while successful works have been reported for words and sentences, recognizing shorter segments of speech, like phones, has proven to be much more challenging due to the lack of temporal and contextual information. Also, head-pose variation remains a known issue for facial analysis with direct impact in this problem. In this context, we propose a novel methodology to tackle the problem of recognizing visemes – the visual equivalent of phonemes – using a GAN to artificially lock the face view into a perfect frontal view, reducing the view angle variability and simplifying the recognition task performed by our classification CNN. The GAN is trained using a large-scale synthetic 2D dataset based on realistic 3D facial models, automatically labelled for different visemes, mapping a slightly random view to a perfect frontal view. We evaluate our method using the GRID corpus, which was processed to extract viseme images and their corresponding synthetic frontal views to be further classified by our CNN model. Our results demonstrate that the additional synthetic frontal view is able to improve accuracy in 5.9% when compared with classification using the original image only.\",\"PeriodicalId\":150600,\"journal\":{\"name\":\"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPRW.2018.00289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPRW.2018.00289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

深度学习方法已经成为视觉语音识别问题的标准方法,因为在文献中报道了其高精度的结果。然而,尽管有成功的单词和句子的研究报道,但由于缺乏时间和上下文信息,识别较短的语音片段(如电话)被证明更具挑战性。此外,头部姿势的变化仍然是面部分析的一个已知问题,对这个问题有直接的影响。在此背景下,我们提出了一种新的方法来解决识别视素(音素的视觉等效物)的问题,使用GAN将人脸视图人为地锁定为完美的正面视图,减少视角变化并简化我们的分类CNN执行的识别任务。GAN使用基于真实3D面部模型的大规模合成2D数据集进行训练,自动标记不同的visemes,将稍微随机的视图映射到完美的正面视图。我们使用GRID语料库来评估我们的方法,该语料库被处理以提取viseme图像及其相应的合成正面视图,并通过我们的CNN模型进一步分类。我们的研究结果表明,与仅使用原始图像进行分类相比,额外的合成正面视图能够提高5.9%的分类准确率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Improving Viseme Recognition Using GAN-Based Frontal View Mapping
Deep learning methods have become the standard for Visual Speech Recognition problems due to their high accuracy results reported in the literature. However, while successful works have been reported for words and sentences, recognizing shorter segments of speech, like phones, has proven to be much more challenging due to the lack of temporal and contextual information. Also, head-pose variation remains a known issue for facial analysis with direct impact in this problem. In this context, we propose a novel methodology to tackle the problem of recognizing visemes – the visual equivalent of phonemes – using a GAN to artificially lock the face view into a perfect frontal view, reducing the view angle variability and simplifying the recognition task performed by our classification CNN. The GAN is trained using a large-scale synthetic 2D dataset based on realistic 3D facial models, automatically labelled for different visemes, mapping a slightly random view to a perfect frontal view. We evaluate our method using the GRID corpus, which was processed to extract viseme images and their corresponding synthetic frontal views to be further classified by our CNN model. Our results demonstrate that the additional synthetic frontal view is able to improve accuracy in 5.9% when compared with classification using the original image only.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信