J. Peymanfard, M. R. Mohammadi, Hossein Zeinali, N. Mozayani
{"title":"Lip reading using external viseme decoding","authors":"J. Peymanfard, M. R. Mohammadi, Hossein Zeinali, N. Mozayani","doi":"10.1109/MVIP53647.2022.9738749","DOIUrl":null,"url":null,"abstract":"Lip-reading is the operation of recognizing speech from lip movements. This is a difficult task because the movements of the lips when pronouncing the words are similar for some of them. Viseme is used to describe lip movements during a conversation. This paper aims to show how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages, namely converting video to viseme and then converting viseme to character by using separate models. Our proposed method improves word error rate by an absolute rate of 4% compared to the typical sequence to sequence lipreading model on the BBC-Oxford Lip Reading dataset (LRS2).","PeriodicalId":184716,"journal":{"name":"2022 International Conference on Machine Vision and Image Processing (MVIP)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Machine Vision and Image Processing (MVIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MVIP53647.2022.9738749","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Lip-reading is the operation of recognizing speech from lip movements. This is a difficult task because the movements of the lips when pronouncing the words are similar for some of them. Viseme is used to describe lip movements during a conversation. This paper aims to show how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages, namely converting video to viseme and then converting viseme to character by using separate models. Our proposed method improves word error rate by an absolute rate of 4% compared to the typical sequence to sequence lipreading model on the BBC-Oxford Lip Reading dataset (LRS2).