Sebastian Cygert, G. Szwoch, Szymon Zaporowski, A. Czyżewski
{"title":"基于口部动作捕捉的语音片段分类","authors":"Sebastian Cygert, G. Szwoch, Szymon Zaporowski, A. Czyżewski","doi":"10.1109/HSI.2018.8430943","DOIUrl":null,"url":null,"abstract":"Visual features convey important information for automatic speech recognition (ASR), especially in noisy environment. The purpose of this study is to evaluate to what extent visual data (i.e. lip reading) can enhance recognition accuracy in the multi-modal approach. For that purpose motion capture markers were placed on speakers' faces to obtain lips tracking data during speaking. Different parameterizations strategies were tested and the accuracy of phonemes recognition in different experiments was analyzed. The obtained results and further challenges related to the bi-modal feature extraction process and decision systems employment are discussed.","PeriodicalId":441117,"journal":{"name":"2018 11th International Conference on Human System Interaction (HSI)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Vocalic Segments Classification Assisted by Mouth Motion Capture\",\"authors\":\"Sebastian Cygert, G. Szwoch, Szymon Zaporowski, A. Czyżewski\",\"doi\":\"10.1109/HSI.2018.8430943\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual features convey important information for automatic speech recognition (ASR), especially in noisy environment. The purpose of this study is to evaluate to what extent visual data (i.e. lip reading) can enhance recognition accuracy in the multi-modal approach. For that purpose motion capture markers were placed on speakers' faces to obtain lips tracking data during speaking. Different parameterizations strategies were tested and the accuracy of phonemes recognition in different experiments was analyzed. The obtained results and further challenges related to the bi-modal feature extraction process and decision systems employment are discussed.\",\"PeriodicalId\":441117,\"journal\":{\"name\":\"2018 11th International Conference on Human System Interaction (HSI)\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 11th International Conference on Human System Interaction (HSI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HSI.2018.8430943\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 11th International Conference on Human System Interaction (HSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HSI.2018.8430943","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Vocalic Segments Classification Assisted by Mouth Motion Capture
Visual features convey important information for automatic speech recognition (ASR), especially in noisy environment. The purpose of this study is to evaluate to what extent visual data (i.e. lip reading) can enhance recognition accuracy in the multi-modal approach. For that purpose motion capture markers were placed on speakers' faces to obtain lips tracking data during speaking. Different parameterizations strategies were tested and the accuracy of phonemes recognition in different experiments was analyzed. The obtained results and further challenges related to the bi-modal feature extraction process and decision systems employment are discussed.