{"title":"Improving acoustic modeling using audio-visual speech","authors":"A. H. Abdelaziz","doi":"10.1109/ICME.2017.8019294","DOIUrl":null,"url":null,"abstract":"Reliable visual features that encode the articulator movements of speakers can dramatically improve the decoding accuracy of automatic speech recognition systems when combined with the corresponding acoustic signals. In this paper, a novel framework is proposed to utilize audio-visual speech not only during decoding but also for training better acoustic models. In this framework, a multi-stream hidden Markov model is iteratively deployed to fuse audio and video likelihoods. The fused likelihoods are used to estimate enhanced frame-state alignments, which are finally used as better training targets. The proposed framework is so flexible that it can be partially used to train acoustic models with the available audio-visual data while a conventional training strategy can be followed with the remaining acoustic data. The experimental results show that the acoustic models trained using the proposed audio-visual framework perform significantly better than those trained conventionally with solely acoustic data in clean and noisy conditions.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2017.8019294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Reliable visual features that encode the articulator movements of speakers can dramatically improve the decoding accuracy of automatic speech recognition systems when combined with the corresponding acoustic signals. In this paper, a novel framework is proposed to utilize audio-visual speech not only during decoding but also for training better acoustic models. In this framework, a multi-stream hidden Markov model is iteratively deployed to fuse audio and video likelihoods. The fused likelihoods are used to estimate enhanced frame-state alignments, which are finally used as better training targets. The proposed framework is so flexible that it can be partially used to train acoustic models with the available audio-visual data while a conventional training strategy can be followed with the remaining acoustic data. The experimental results show that the acoustic models trained using the proposed audio-visual framework perform significantly better than those trained conventionally with solely acoustic data in clean and noisy conditions.