Vocalic Segments Classification Assisted by Mouth Motion Capture

2018 11th International Conference on Human System Interaction (HSI) Pub Date : 2018-07-01 DOI:10.1109/HSI.2018.8430943

Sebastian Cygert, G. Szwoch, Szymon Zaporowski, A. Czyżewski

引用次数: 3

Abstract

Visual features convey important information for automatic speech recognition (ASR), especially in noisy environment. The purpose of this study is to evaluate to what extent visual data (i.e. lip reading) can enhance recognition accuracy in the multi-modal approach. For that purpose motion capture markers were placed on speakers' faces to obtain lips tracking data during speaking. Different parameterizations strategies were tested and the accuracy of phonemes recognition in different experiments was analyzed. The obtained results and further challenges related to the bi-modal feature extraction process and decision systems employment are discussed.

查看原文本刊更多论文

基于口部动作捕捉的语音片段分类

视觉特征为自动语音识别(ASR)提供了重要的信息，特别是在噪声环境中。本研究的目的是评估视觉数据(即唇读)在多大程度上可以提高多模态方法的识别准确性。为此，在说话者的脸上放置动作捕捉标记，以获取说话时嘴唇的跟踪数据。测试了不同的参数化策略，并分析了不同实验中音素识别的准确率。讨论了已获得的结果以及与双模态特征提取过程和决策系统使用相关的进一步挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 11th International Conference on Human System Interaction (HSI)

自引率

0.00%

发文量