基于混合视觉特征的深度神经网络唇读

IF 0.8 4区 计算机科学 Q4 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY
Fatemeh Vakhshiteh, F. Almasganj, A. Nickabadi
{"title":"基于混合视觉特征的深度神经网络唇读","authors":"Fatemeh Vakhshiteh, F. Almasganj, A. Nickabadi","doi":"10.5566/IAS.1859","DOIUrl":null,"url":null,"abstract":"Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.","PeriodicalId":49062,"journal":{"name":"Image Analysis & Stereology","volume":"5 1","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES\",\"authors\":\"Fatemeh Vakhshiteh, F. Almasganj, A. Nickabadi\",\"doi\":\"10.5566/IAS.1859\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.\",\"PeriodicalId\":49062,\"journal\":{\"name\":\"Image Analysis & Stereology\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2018-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image Analysis & Stereology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.5566/IAS.1859\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image Analysis & Stereology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.5566/IAS.1859","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}
引用次数: 9

摘要

唇读通常被称为视觉解读说话者在说话过程中的嘴唇动作。多年来的实验表明,如果可以获得视觉面部信息,语音的可理解性就会提高。这种影响在嘈杂的环境中变得更加明显。在实现这一过程自动化的过程中,会提出一些挑战,如协同发音现象、视觉单元的类型、特征多样性及其说话人之间的依赖性。虽然已经努力克服了这些挑战,但完美的唇读系统仍在研究中。本文寻找一种具有高效发展的融合和排列处理块的唇读模型,以提取高度判别的视觉特征。本文重点介绍了结构合理的基于深度信念网络(DBN)的识别器的应用。在CUAVE数据库上执行多说话人(MS)和独立说话人(SI)任务,手机识别率分别达到77.65%和73.40%。MS和SI任务的最佳词识别率分别为80.25%和76.91%。结果表明,该方法优于传统的隐马尔可夫模型(HMM),可以与目前最先进的视觉语音识别技术相媲美。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Image Analysis & Stereology
Image Analysis & Stereology MATERIALS SCIENCE, MULTIDISCIPLINARY-MATHEMATICS, APPLIED
CiteScore
2.00
自引率
0.00%
发文量
7
审稿时长
>12 weeks
期刊介绍: Image Analysis and Stereology is the official journal of the International Society for Stereology & Image Analysis. It promotes the exchange of scientific, technical, organizational and other information on the quantitative analysis of data having a geometrical structure, including stereology, differential geometry, image analysis, image processing, mathematical morphology, stochastic geometry, statistics, pattern recognition, and related topics. The fields of application are not restricted and range from biomedicine, materials sciences and physics to geology and geography.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信