有限训练数据的自动唇读

18th International Conference on Pattern Recognition (ICPR'06) Pub Date : 2006-08-20 DOI:10.1109/ICPR.2006.301

Shi-Lin Wang, W. Lau, S. Leung

{"title":"有限训练数据的自动唇读","authors":"Shi-Lin Wang, W. Lau, S. Leung","doi":"10.1109/ICPR.2006.301","DOIUrl":null,"url":null,"abstract":"Speech recognition solely based on visual information such as the lip shape and its movement is referred to as lipreading. This paper presents an automatic lipreading technique for speaker dependent (SD) and speaker independent (SI) speech recognition tasks. Since the visual features are derived according to the frame rate of the video sequence, spline representation is then employed to translate the discrete-time sampled visual features into continuous domain. The spline coefficients in the same word class are constrained to have similar expression and can be estimated from the training data by the EM algorithm. In addition, an adaptive multi-model approach is proposed to overcome the variation caused by different speaking style in speaker-independent recognition task. The experiments are carried out to recognize the ten English digits and an accuracy of 96% for speaker dependent recognition and 88% for speaker independent recognition have been achieved, which shows the superiority of our approach compared with other classifiers investigated","PeriodicalId":236033,"journal":{"name":"18th International Conference on Pattern Recognition (ICPR'06)","volume":"597 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Automatic Lipreading with Limited Training Data\",\"authors\":\"Shi-Lin Wang, W. Lau, S. Leung\",\"doi\":\"10.1109/ICPR.2006.301\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech recognition solely based on visual information such as the lip shape and its movement is referred to as lipreading. This paper presents an automatic lipreading technique for speaker dependent (SD) and speaker independent (SI) speech recognition tasks. Since the visual features are derived according to the frame rate of the video sequence, spline representation is then employed to translate the discrete-time sampled visual features into continuous domain. The spline coefficients in the same word class are constrained to have similar expression and can be estimated from the training data by the EM algorithm. In addition, an adaptive multi-model approach is proposed to overcome the variation caused by different speaking style in speaker-independent recognition task. The experiments are carried out to recognize the ten English digits and an accuracy of 96% for speaker dependent recognition and 88% for speaker independent recognition have been achieved, which shows the superiority of our approach compared with other classifiers investigated\",\"PeriodicalId\":236033,\"journal\":{\"name\":\"18th International Conference on Pattern Recognition (ICPR'06)\",\"volume\":\"597 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"18th International Conference on Pattern Recognition (ICPR'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPR.2006.301\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"18th International Conference on Pattern Recognition (ICPR'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPR.2006.301","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

仅仅基于视觉信息(如唇形及其运动)的语音识别被称为唇读。本文提出了一种自动唇读技术，用于依赖说话人(SD)和独立说话人(SI)语音识别任务。由于视觉特征是根据视频序列的帧率导出的，因此采用样条表示将离散时间采样的视觉特征转换为连续域。该算法将同一词类中的样条系数约束为具有相似的表达式，并能从训练数据中估计出样条系数。此外，提出了一种自适应多模型方法来克服说话人独立识别任务中不同说话风格所带来的差异。对10个英文数字进行了识别实验，对说话人依赖的识别准确率达到96%，对说话人独立的识别准确率达到88%，与已有的分类器相比，该方法具有明显的优越性

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic Lipreading with Limited Training Data

Speech recognition solely based on visual information such as the lip shape and its movement is referred to as lipreading. This paper presents an automatic lipreading technique for speaker dependent (SD) and speaker independent (SI) speech recognition tasks. Since the visual features are derived according to the frame rate of the video sequence, spline representation is then employed to translate the discrete-time sampled visual features into continuous domain. The spline coefficients in the same word class are constrained to have similar expression and can be estimated from the training data by the EM algorithm. In addition, an adaptive multi-model approach is proposed to overcome the variation caused by different speaking style in speaker-independent recognition task. The experiments are carried out to recognize the ten English digits and an accuracy of 96% for speaker dependent recognition and 88% for speaker independent recognition have been achieved, which shows the superiority of our approach compared with other classifiers investigated

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

18th International Conference on Pattern Recognition (ICPR'06)

自引率

0.00%

发文量