{"title":"Multi-modal temporal asynchronicity modeling by product HMMs for robust audio-visual speech recognition","authors":"Satoshi Nakamura, K. Kumatani, S. Tamura","doi":"10.1109/ICMI.2002.1167011","DOIUrl":null,"url":null,"abstract":"The demand for audio-visual speech recognition (AVSR) has increased in order to make speech recognition systems robust to acoustic noise. There are two kinds of research issue in audio-visual speech recognition, such as integration modeling considering asynchronicity between modalities and adaptive information weighting according information reliability. This paper proposes a method to effectively integrate audio and visual information. Such integration, inevitably, necessitates modeling the synchronization and asynchronization of audio and visual information. To address the time lag and correlation problems in individual features between speech and lip movements, we introduce a type of integrated HMM modeling of audio-visual information based on a family of a product HMM. The proposed model can represent state synchronicity not only within a phoneme, but also between phonemes. Furthermore, we also propose a rapid stream weight optimization based on the GPD algorithm for noisy, bimodal speech recognition. Evaluation experiments show that the proposed method improves the recognition accuracy for noisy speech. When SNR=0 dB our proposed method attained 16% higher performance compared to a product HMM without synchronicity re-estimation.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"293 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMI.2002.1167011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
The demand for audio-visual speech recognition (AVSR) has increased in order to make speech recognition systems robust to acoustic noise. There are two kinds of research issue in audio-visual speech recognition, such as integration modeling considering asynchronicity between modalities and adaptive information weighting according information reliability. This paper proposes a method to effectively integrate audio and visual information. Such integration, inevitably, necessitates modeling the synchronization and asynchronization of audio and visual information. To address the time lag and correlation problems in individual features between speech and lip movements, we introduce a type of integrated HMM modeling of audio-visual information based on a family of a product HMM. The proposed model can represent state synchronicity not only within a phoneme, but also between phonemes. Furthermore, we also propose a rapid stream weight optimization based on the GPD algorithm for noisy, bimodal speech recognition. Evaluation experiments show that the proposed method improves the recognition accuracy for noisy speech. When SNR=0 dB our proposed method attained 16% higher performance compared to a product HMM without synchronicity re-estimation.