Improving Viseme Recognition with GAN-based Muti-view Mapping

IEEE International Conference on Automatic Face & Gesture Recognition and Workshops Pub Date : 2019-05-01 DOI:10.1109/FG.2019.8756589

Dario Augusto Borges Oliveira, Andréa Britto Mattos, E. Morais

{"title":"Improving Viseme Recognition with GAN-based Muti-view Mapping","authors":"Dario Augusto Borges Oliveira, Andréa Britto Mattos, E. Morais","doi":"10.1109/FG.2019.8756589","DOIUrl":null,"url":null,"abstract":"Speech recognition technologies in the visual domain currently can only identify words and sentences in still images. Identifying visemes (i.e., the smallest visual units of spoken text) is useful when there are no language models or dictionaries available, which is often the case for languages besides English; however, it is a challenge, as temporal information cannot be extracted. In parallel, previous works demonstrated that exploring data acquired simultaneously under multiple views can improve the recognition accuracy in comparison to single-view data. For many different applications, however, most of the available audio-visual datasets are obtained from a single view, essentially due to acquisition limitations. In this work, we address viseme recognition in still images and explore the synthetic generation of additional views to improve overall accuracy. For that, we use Generative Adversarial Networks (GANs) trained with synthetic data and map from mouth images acquired in a single arbitrary view to frontal and side views – in which the face is rotated vertically at approximately 30°, 45°, and 60°. Then, we use a state-of-art Convolutional Neural Network for classifying the visemes and compare its performance when training only with the original single-view images versus training with the additional views artificially generated by the GANs. We run experiments using three audiovisual corpora acquired under different conditions (GRID, AVICAR, and OuluVS2 datasets) and our results indicate that the additional views synthesized by the GANs are able to improve the viseme recognition accuracy on all tested scenarios.","PeriodicalId":91494,"journal":{"name":"IEEE International Conference on Automatic Face & Gesture Recognition and Workshops","volume":"21 1","pages":"1-8"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Automatic Face & Gesture Recognition and Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FG.2019.8756589","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Speech recognition technologies in the visual domain currently can only identify words and sentences in still images. Identifying visemes (i.e., the smallest visual units of spoken text) is useful when there are no language models or dictionaries available, which is often the case for languages besides English; however, it is a challenge, as temporal information cannot be extracted. In parallel, previous works demonstrated that exploring data acquired simultaneously under multiple views can improve the recognition accuracy in comparison to single-view data. For many different applications, however, most of the available audio-visual datasets are obtained from a single view, essentially due to acquisition limitations. In this work, we address viseme recognition in still images and explore the synthetic generation of additional views to improve overall accuracy. For that, we use Generative Adversarial Networks (GANs) trained with synthetic data and map from mouth images acquired in a single arbitrary view to frontal and side views – in which the face is rotated vertically at approximately 30°, 45°, and 60°. Then, we use a state-of-art Convolutional Neural Network for classifying the visemes and compare its performance when training only with the original single-view images versus training with the additional views artificially generated by the GANs. We run experiments using three audiovisual corpora acquired under different conditions (GRID, AVICAR, and OuluVS2 datasets) and our results indicate that the additional views synthesized by the GANs are able to improve the viseme recognition accuracy on all tested scenarios.

查看原文本刊更多论文

基于gan的多视图映射改进Viseme识别

视觉领域的语音识别技术目前只能识别静止图像中的单词和句子。当没有可用的语言模型或字典时，识别语素(即口语文本的最小视觉单位)是有用的，这通常是英语以外的语言的情况;然而，这是一个挑战，因为不能提取时间信息。同时，先前的研究表明，与单视图数据相比，在多个视图下同时获取数据可以提高识别精度。然而，对于许多不同的应用，大多数可用的视听数据集都是从单一视图获得的，这主要是由于获取限制。在这项工作中，我们解决了静态图像中的视觉识别问题，并探索了额外视图的合成生成，以提高整体准确性。为此，我们使用生成式对抗网络(GANs)训练合成数据和从单个任意视图获取的嘴部图像到正面和侧面视图的地图，其中面部垂直旋转约30°，45°和60°。然后，我们使用最先进的卷积神经网络对视点进行分类，并比较仅使用原始单视图图像进行训练与使用gan人工生成的附加视图进行训练时的性能。我们使用在不同条件下获得的三个视听语料库(GRID、AVICAR和OuluVS2数据集)进行了实验，结果表明，gan合成的附加视图能够提高所有测试场景下的视觉识别精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Automatic Face & Gesture Recognition and Workshops

自引率

0.00%

发文量