{"title":"Enhancing CTC-Based Visual Speech Recognition","authors":"Hendrik Laux, Anke Schmeink","doi":"arxiv-2409.07210","DOIUrl":null,"url":null,"abstract":"This paper presents LiteVSR2, an enhanced version of our previously\nintroduced efficient approach to Visual Speech Recognition (VSR). Building upon\nour knowledge distillation framework from a pre-trained Automatic Speech\nRecognition (ASR) model, we introduce two key improvements: a stabilized video\npreprocessing technique and feature normalization in the distillation process.\nThese improvements yield substantial performance gains on the LRS2 and LRS3\nbenchmarks, positioning LiteVSR2 as the current best CTC-based VSR model\nwithout increasing the volume of training data or computational resources\nutilized. Furthermore, we explore the scalability of our approach by examining\nperformance metrics across varying model complexities and training data\nvolumes. LiteVSR2 maintains the efficiency of its predecessor while\nsignificantly enhancing accuracy, thereby demonstrating the potential for\nresource-efficient advancements in VSR technology.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07210","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents LiteVSR2, an enhanced version of our previously
introduced efficient approach to Visual Speech Recognition (VSR). Building upon
our knowledge distillation framework from a pre-trained Automatic Speech
Recognition (ASR) model, we introduce two key improvements: a stabilized video
preprocessing technique and feature normalization in the distillation process.
These improvements yield substantial performance gains on the LRS2 and LRS3
benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model
without increasing the volume of training data or computational resources
utilized. Furthermore, we explore the scalability of our approach by examining
performance metrics across varying model complexities and training data
volumes. LiteVSR2 maintains the efficiency of its predecessor while
significantly enhancing accuracy, thereby demonstrating the potential for
resource-efficient advancements in VSR technology.