Enhancing quality and accuracy of speech recognition system by using multimodal audio-visual speech signal

2016 12th International Computer Engineering Conference (ICENCO) Pub Date : 2016-12-01 DOI:10.1109/ICENCO.2016.7856472

Eslam E. El Maghraby, A. Gody, M. Farouk

{"title":"Enhancing quality and accuracy of speech recognition system by using multimodal audio-visual speech signal","authors":"Eslam E. El Maghraby, A. Gody, M. Farouk","doi":"10.1109/ICENCO.2016.7856472","DOIUrl":null,"url":null,"abstract":"Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that prevent its use in many real-world applications, particularly under adverse conditions. This paper aims to build a connected-words audio visual speech recognition system (AV-ASR) for English language that uses both acoustic and visual speech information to improve the recognition performance. Mel frequency cepstral coefficients (MFCCs) have been used to extract the audio features from the speech-files. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients have been used to extract the visual feature from the speaker's mouth region and Principle Component Analysis (PCA) have been used for dimensionality reduction purpose, These features are then concatenated with traditional audio ones, and the resulting features are used for training hidden Markov models (HMMs) parameters using word level acoustic models. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition. The potential of the suggested approach is demonstrate by a preliminary experiment on the GRID sentence database one of the largest databases available for audio-visual recognition system, which contains continuous English voice commands for a small vocabulary task. The experimental results show that the proposed Audio Video Speech Recognizer (AV-ASR) system exhibits higher recognition rate in comparison to an audio-only recognizer as well as it indicates robust performance. An increase of success rate by 3.9% for the grammar based word recognition system overall speakers is achieved for speaker independent test and for speaker dependent, it changes from speaker to another between 7% and 1%. Also when test the system under noisy environment it improve the result.","PeriodicalId":332360,"journal":{"name":"2016 12th International Computer Engineering Conference (ICENCO)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th International Computer Engineering Conference (ICENCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICENCO.2016.7856472","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that prevent its use in many real-world applications, particularly under adverse conditions. This paper aims to build a connected-words audio visual speech recognition system (AV-ASR) for English language that uses both acoustic and visual speech information to improve the recognition performance. Mel frequency cepstral coefficients (MFCCs) have been used to extract the audio features from the speech-files. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients have been used to extract the visual feature from the speaker's mouth region and Principle Component Analysis (PCA) have been used for dimensionality reduction purpose, These features are then concatenated with traditional audio ones, and the resulting features are used for training hidden Markov models (HMMs) parameters using word level acoustic models. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition. The potential of the suggested approach is demonstrate by a preliminary experiment on the GRID sentence database one of the largest databases available for audio-visual recognition system, which contains continuous English voice commands for a small vocabulary task. The experimental results show that the proposed Audio Video Speech Recognizer (AV-ASR) system exhibits higher recognition rate in comparison to an audio-only recognizer as well as it indicates robust performance. An increase of success rate by 3.9% for the grammar based word recognition system overall speakers is achieved for speaker independent test and for speaker dependent, it changes from speaker to another between 7% and 1%. Also when test the system under noisy environment it improve the result.

查看原文本刊更多论文

利用多模态视听语音信号提高语音识别系统的质量和准确性

大多数基于语音的自动识别的发展都依赖于声学语音作为唯一的输入信号，而忽略了视觉信号。然而，仅基于声学语音的识别可能存在缺陷，妨碍其在许多实际应用中使用，特别是在不利条件下。本文旨在构建一个结合语音信息和视觉信息的英语连词视听语音识别系统(AV-ASR)来提高识别性能。本文利用低频倒谱系数(MFCCs)从语音文件中提取音频特征。对于视觉对应部分，使用离散余弦变换(DCT)系数从说话人的嘴部区域提取视觉特征，并使用主成分分析(PCA)进行降维，然后将这些特征与传统的音频特征连接起来，所得特征用于使用词级声学模型训练隐马尔可夫模型(hmm)参数。该系统使用隐马尔可夫模型工具包(HTK)开发，该工具包使用隐马尔可夫模型(hmm)进行识别。在GRID句子数据库上的初步实验证明了该方法的潜力，GRID句子数据库是视听识别系统中最大的数据库之一，它包含用于小词汇任务的连续英语语音命令。实验结果表明，所提出的音视频语音识别器(AV-ASR)系统具有比纯音频识别器更高的识别率和鲁棒性。在独立测试中，基于语法的单词识别系统的成功率提高了3.9%，而在独立测试中，基于语法的单词识别系统的成功率在7%到1%之间变化。并对系统进行了噪声环境下的测试，改善了测试结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 12th International Computer Engineering Conference (ICENCO)

自引率

0.00%

发文量