Enhancing quality and accuracy of speech recognition system by using multimodal audio-visual speech signal

Eslam E. El Maghraby, A. Gody, M. Farouk
{"title":"Enhancing quality and accuracy of speech recognition system by using multimodal audio-visual speech signal","authors":"Eslam E. El Maghraby, A. Gody, M. Farouk","doi":"10.1109/ICENCO.2016.7856472","DOIUrl":null,"url":null,"abstract":"Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that prevent its use in many real-world applications, particularly under adverse conditions. This paper aims to build a connected-words audio visual speech recognition system (AV-ASR) for English language that uses both acoustic and visual speech information to improve the recognition performance. Mel frequency cepstral coefficients (MFCCs) have been used to extract the audio features from the speech-files. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients have been used to extract the visual feature from the speaker's mouth region and Principle Component Analysis (PCA) have been used for dimensionality reduction purpose, These features are then concatenated with traditional audio ones, and the resulting features are used for training hidden Markov models (HMMs) parameters using word level acoustic models. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition. The potential of the suggested approach is demonstrate by a preliminary experiment on the GRID sentence database one of the largest databases available for audio-visual recognition system, which contains continuous English voice commands for a small vocabulary task. The experimental results show that the proposed Audio Video Speech Recognizer (AV-ASR) system exhibits higher recognition rate in comparison to an audio-only recognizer as well as it indicates robust performance. An increase of success rate by 3.9% for the grammar based word recognition system overall speakers is achieved for speaker independent test and for speaker dependent, it changes from speaker to another between 7% and 1%. Also when test the system under noisy environment it improve the result.","PeriodicalId":332360,"journal":{"name":"2016 12th International Computer Engineering Conference (ICENCO)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th International Computer Engineering Conference (ICENCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICENCO.2016.7856472","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that prevent its use in many real-world applications, particularly under adverse conditions. This paper aims to build a connected-words audio visual speech recognition system (AV-ASR) for English language that uses both acoustic and visual speech information to improve the recognition performance. Mel frequency cepstral coefficients (MFCCs) have been used to extract the audio features from the speech-files. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients have been used to extract the visual feature from the speaker's mouth region and Principle Component Analysis (PCA) have been used for dimensionality reduction purpose, These features are then concatenated with traditional audio ones, and the resulting features are used for training hidden Markov models (HMMs) parameters using word level acoustic models. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition. The potential of the suggested approach is demonstrate by a preliminary experiment on the GRID sentence database one of the largest databases available for audio-visual recognition system, which contains continuous English voice commands for a small vocabulary task. The experimental results show that the proposed Audio Video Speech Recognizer (AV-ASR) system exhibits higher recognition rate in comparison to an audio-only recognizer as well as it indicates robust performance. An increase of success rate by 3.9% for the grammar based word recognition system overall speakers is achieved for speaker independent test and for speaker dependent, it changes from speaker to another between 7% and 1%. Also when test the system under noisy environment it improve the result.
利用多模态视听语音信号提高语音识别系统的质量和准确性
大多数基于语音的自动识别的发展都依赖于声学语音作为唯一的输入信号,而忽略了视觉信号。然而,仅基于声学语音的识别可能存在缺陷,妨碍其在许多实际应用中使用,特别是在不利条件下。本文旨在构建一个结合语音信息和视觉信息的英语连词视听语音识别系统(AV-ASR)来提高识别性能。本文利用低频倒谱系数(MFCCs)从语音文件中提取音频特征。对于视觉对应部分,使用离散余弦变换(DCT)系数从说话人的嘴部区域提取视觉特征,并使用主成分分析(PCA)进行降维,然后将这些特征与传统的音频特征连接起来,所得特征用于使用词级声学模型训练隐马尔可夫模型(hmm)参数。该系统使用隐马尔可夫模型工具包(HTK)开发,该工具包使用隐马尔可夫模型(hmm)进行识别。在GRID句子数据库上的初步实验证明了该方法的潜力,GRID句子数据库是视听识别系统中最大的数据库之一,它包含用于小词汇任务的连续英语语音命令。实验结果表明,所提出的音视频语音识别器(AV-ASR)系统具有比纯音频识别器更高的识别率和鲁棒性。在独立测试中,基于语法的单词识别系统的成功率提高了3.9%,而在独立测试中,基于语法的单词识别系统的成功率在7%到1%之间变化。并对系统进行了噪声环境下的测试,改善了测试结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信