{"title":"自动视觉唇读:机器学习方法的比较回顾","authors":"Khosro Rezaee , Maryam Yeganeh","doi":"10.1016/j.rineng.2025.107171","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic lip-reading systems are evolving from traditional handcrafted pipelines to advanced deep and hybrid architectures that integrate local motion modeling with long-range temporal context. This review provides a comprehensive synthesis of classical techniques and state-of-the-art learning approaches, with a specific focus on hybrid three-dimensional convolution plus Transformer or Conformer backbones and on multimodal training strategies that enable visual-only inference. Unlike previous surveys, we critically appraise datasets through the lenses of diversity, realism, robustness, and efficiency, and we foreground responsible deployment by addressing privacy, fairness, and transparency. We propose a clear taxonomy that spans classical, hybrid, and Transformer-based models. We compare their strengths and limitations for both word- and sentence-level recognition, and analyze the trade-offs between accuracy, computational cost, latency, and interpretability. The evidence indicates that lightweight hybrid models offer high accuracy with practical efficiency and that audio-as-teacher training significantly improves visual reliability when audio is unavailable. However, progress remains constrained by limited demographic and linguistic coverage, a reliance on studio-style capture, and uneven robustness to real-world challenges like pose, illumination, motion blur, and occlusion. The review concludes with a focused call to action: we must build multilingual and demographically balanced corpora with standardized robustness testing; develop parameter-efficient hybrid backbones suitable for edge deployment; adopt self-supervised and semi-supervised learning to reduce annotation demands; and report calibrated uncertainty, fairness diagnostics, and transparent documentation. These recommendations are intended to guide researchers toward creating scalable, reliable, and trustworthy lip-reading systems for real-world applications.</div></div>","PeriodicalId":36919,"journal":{"name":"Results in Engineering","volume":"28 ","pages":"Article 107171"},"PeriodicalIF":7.9000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic visual lip reading: A comparative review of machine-learning approaches\",\"authors\":\"Khosro Rezaee , Maryam Yeganeh\",\"doi\":\"10.1016/j.rineng.2025.107171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automatic lip-reading systems are evolving from traditional handcrafted pipelines to advanced deep and hybrid architectures that integrate local motion modeling with long-range temporal context. This review provides a comprehensive synthesis of classical techniques and state-of-the-art learning approaches, with a specific focus on hybrid three-dimensional convolution plus Transformer or Conformer backbones and on multimodal training strategies that enable visual-only inference. Unlike previous surveys, we critically appraise datasets through the lenses of diversity, realism, robustness, and efficiency, and we foreground responsible deployment by addressing privacy, fairness, and transparency. We propose a clear taxonomy that spans classical, hybrid, and Transformer-based models. We compare their strengths and limitations for both word- and sentence-level recognition, and analyze the trade-offs between accuracy, computational cost, latency, and interpretability. The evidence indicates that lightweight hybrid models offer high accuracy with practical efficiency and that audio-as-teacher training significantly improves visual reliability when audio is unavailable. However, progress remains constrained by limited demographic and linguistic coverage, a reliance on studio-style capture, and uneven robustness to real-world challenges like pose, illumination, motion blur, and occlusion. The review concludes with a focused call to action: we must build multilingual and demographically balanced corpora with standardized robustness testing; develop parameter-efficient hybrid backbones suitable for edge deployment; adopt self-supervised and semi-supervised learning to reduce annotation demands; and report calibrated uncertainty, fairness diagnostics, and transparent documentation. These recommendations are intended to guide researchers toward creating scalable, reliable, and trustworthy lip-reading systems for real-world applications.</div></div>\",\"PeriodicalId\":36919,\"journal\":{\"name\":\"Results in Engineering\",\"volume\":\"28 \",\"pages\":\"Article 107171\"},\"PeriodicalIF\":7.9000,\"publicationDate\":\"2025-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Results in Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590123025032268\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Results in Engineering","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590123025032268","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
Automatic visual lip reading: A comparative review of machine-learning approaches
Automatic lip-reading systems are evolving from traditional handcrafted pipelines to advanced deep and hybrid architectures that integrate local motion modeling with long-range temporal context. This review provides a comprehensive synthesis of classical techniques and state-of-the-art learning approaches, with a specific focus on hybrid three-dimensional convolution plus Transformer or Conformer backbones and on multimodal training strategies that enable visual-only inference. Unlike previous surveys, we critically appraise datasets through the lenses of diversity, realism, robustness, and efficiency, and we foreground responsible deployment by addressing privacy, fairness, and transparency. We propose a clear taxonomy that spans classical, hybrid, and Transformer-based models. We compare their strengths and limitations for both word- and sentence-level recognition, and analyze the trade-offs between accuracy, computational cost, latency, and interpretability. The evidence indicates that lightweight hybrid models offer high accuracy with practical efficiency and that audio-as-teacher training significantly improves visual reliability when audio is unavailable. However, progress remains constrained by limited demographic and linguistic coverage, a reliance on studio-style capture, and uneven robustness to real-world challenges like pose, illumination, motion blur, and occlusion. The review concludes with a focused call to action: we must build multilingual and demographically balanced corpora with standardized robustness testing; develop parameter-efficient hybrid backbones suitable for edge deployment; adopt self-supervised and semi-supervised learning to reduce annotation demands; and report calibrated uncertainty, fairness diagnostics, and transparent documentation. These recommendations are intended to guide researchers toward creating scalable, reliable, and trustworthy lip-reading systems for real-world applications.