基于口部动作捕捉的语音片段分类

2018 11th International Conference on Human System Interaction (HSI) Pub Date : 2018-07-01 DOI:10.1109/HSI.2018.8430943

Sebastian Cygert, G. Szwoch, Szymon Zaporowski, A. Czyżewski

{"title":"基于口部动作捕捉的语音片段分类","authors":"Sebastian Cygert, G. Szwoch, Szymon Zaporowski, A. Czyżewski","doi":"10.1109/HSI.2018.8430943","DOIUrl":null,"url":null,"abstract":"Visual features convey important information for automatic speech recognition (ASR), especially in noisy environment. The purpose of this study is to evaluate to what extent visual data (i.e. lip reading) can enhance recognition accuracy in the multi-modal approach. For that purpose motion capture markers were placed on speakers' faces to obtain lips tracking data during speaking. Different parameterizations strategies were tested and the accuracy of phonemes recognition in different experiments was analyzed. The obtained results and further challenges related to the bi-modal feature extraction process and decision systems employment are discussed.","PeriodicalId":441117,"journal":{"name":"2018 11th International Conference on Human System Interaction (HSI)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Vocalic Segments Classification Assisted by Mouth Motion Capture\",\"authors\":\"Sebastian Cygert, G. Szwoch, Szymon Zaporowski, A. Czyżewski\",\"doi\":\"10.1109/HSI.2018.8430943\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual features convey important information for automatic speech recognition (ASR), especially in noisy environment. The purpose of this study is to evaluate to what extent visual data (i.e. lip reading) can enhance recognition accuracy in the multi-modal approach. For that purpose motion capture markers were placed on speakers' faces to obtain lips tracking data during speaking. Different parameterizations strategies were tested and the accuracy of phonemes recognition in different experiments was analyzed. The obtained results and further challenges related to the bi-modal feature extraction process and decision systems employment are discussed.\",\"PeriodicalId\":441117,\"journal\":{\"name\":\"2018 11th International Conference on Human System Interaction (HSI)\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 11th International Conference on Human System Interaction (HSI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HSI.2018.8430943\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 11th International Conference on Human System Interaction (HSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HSI.2018.8430943","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

视觉特征为自动语音识别(ASR)提供了重要的信息，特别是在噪声环境中。本研究的目的是评估视觉数据(即唇读)在多大程度上可以提高多模态方法的识别准确性。为此，在说话者的脸上放置动作捕捉标记，以获取说话时嘴唇的跟踪数据。测试了不同的参数化策略，并分析了不同实验中音素识别的准确率。讨论了已获得的结果以及与双模态特征提取过程和决策系统使用相关的进一步挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Vocalic Segments Classification Assisted by Mouth Motion Capture

Visual features convey important information for automatic speech recognition (ASR), especially in noisy environment. The purpose of this study is to evaluate to what extent visual data (i.e. lip reading) can enhance recognition accuracy in the multi-modal approach. For that purpose motion capture markers were placed on speakers' faces to obtain lips tracking data during speaking. Different parameterizations strategies were tested and the accuracy of phonemes recognition in different experiments was analyzed. The obtained results and further challenges related to the bi-modal feature extraction process and decision systems employment are discussed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 11th International Conference on Human System Interaction (HSI)

自引率

0.00%

发文量