基于混合视觉特征的深度神经网络唇读

IF 0.8 4区计算机科学 Q4 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY

Image Analysis & Stereology Pub Date : 2018-07-09 DOI:10.5566/IAS.1859

Fatemeh Vakhshiteh, F. Almasganj, A. Nickabadi

{"title":"基于混合视觉特征的深度神经网络唇读","authors":"Fatemeh Vakhshiteh, F. Almasganj, A. Nickabadi","doi":"10.5566/IAS.1859","DOIUrl":null,"url":null,"abstract":"Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.","PeriodicalId":49062,"journal":{"name":"Image Analysis & Stereology","volume":"5 1","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES\",\"authors\":\"Fatemeh Vakhshiteh, F. Almasganj, A. Nickabadi\",\"doi\":\"10.5566/IAS.1859\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.\",\"PeriodicalId\":49062,\"journal\":{\"name\":\"Image Analysis & Stereology\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2018-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image Analysis & Stereology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.5566/IAS.1859\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image Analysis & Stereology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.5566/IAS.1859","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}

引用次数: 9

摘要

唇读通常被称为视觉解读说话者在说话过程中的嘴唇动作。多年来的实验表明，如果可以获得视觉面部信息，语音的可理解性就会提高。这种影响在嘈杂的环境中变得更加明显。在实现这一过程自动化的过程中，会提出一些挑战，如协同发音现象、视觉单元的类型、特征多样性及其说话人之间的依赖性。虽然已经努力克服了这些挑战，但完美的唇读系统仍在研究中。本文寻找一种具有高效发展的融合和排列处理块的唇读模型，以提取高度判别的视觉特征。本文重点介绍了结构合理的基于深度信念网络(DBN)的识别器的应用。在CUAVE数据库上执行多说话人(MS)和独立说话人(SI)任务，手机识别率分别达到77.65%和73.40%。MS和SI任务的最佳词识别率分别为80.25%和76.91%。结果表明，该方法优于传统的隐马尔可夫模型(HMM)，可以与目前最先进的视觉语音识别技术相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES

Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image Analysis & Stereology MATERIALS SCIENCE, MULTIDISCIPLINARY-MATHEMATICS, APPLIED

CiteScore

2.00

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Image Analysis and Stereology is the official journal of the International Society for Stereology & Image Analysis. It promotes the exchange of scientific, technical, organizational and other information on the quantitative analysis of data having a geometrical structure, including stereology, differential geometry, image analysis, image processing, mathematical morphology, stochastic geometry, statistics, pattern recognition, and related topics. The fields of application are not restricted and range from biomedicine, materials sciences and physics to geology and geography.