自动视觉唇读：机器学习方法的比较回顾

IF 7.9 Q1 ENGINEERING, MULTIDISCIPLINARY

Results in Engineering Pub Date : 2025-09-08 DOI:10.1016/j.rineng.2025.107171

Khosro Rezaee , Maryam Yeganeh

{"title":"自动视觉唇读：机器学习方法的比较回顾","authors":"Khosro Rezaee , Maryam Yeganeh","doi":"10.1016/j.rineng.2025.107171","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic lip-reading systems are evolving from traditional handcrafted pipelines to advanced deep and hybrid architectures that integrate local motion modeling with long-range temporal context. This review provides a comprehensive synthesis of classical techniques and state-of-the-art learning approaches, with a specific focus on hybrid three-dimensional convolution plus Transformer or Conformer backbones and on multimodal training strategies that enable visual-only inference. Unlike previous surveys, we critically appraise datasets through the lenses of diversity, realism, robustness, and efficiency, and we foreground responsible deployment by addressing privacy, fairness, and transparency. We propose a clear taxonomy that spans classical, hybrid, and Transformer-based models. We compare their strengths and limitations for both word- and sentence-level recognition, and analyze the trade-offs between accuracy, computational cost, latency, and interpretability. The evidence indicates that lightweight hybrid models offer high accuracy with practical efficiency and that audio-as-teacher training significantly improves visual reliability when audio is unavailable. However, progress remains constrained by limited demographic and linguistic coverage, a reliance on studio-style capture, and uneven robustness to real-world challenges like pose, illumination, motion blur, and occlusion. The review concludes with a focused call to action: we must build multilingual and demographically balanced corpora with standardized robustness testing; develop parameter-efficient hybrid backbones suitable for edge deployment; adopt self-supervised and semi-supervised learning to reduce annotation demands; and report calibrated uncertainty, fairness diagnostics, and transparent documentation. These recommendations are intended to guide researchers toward creating scalable, reliable, and trustworthy lip-reading systems for real-world applications.</div></div>","PeriodicalId":36919,"journal":{"name":"Results in Engineering","volume":"28 ","pages":"Article 107171"},"PeriodicalIF":7.9000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic visual lip reading: A comparative review of machine-learning approaches\",\"authors\":\"Khosro Rezaee , Maryam Yeganeh\",\"doi\":\"10.1016/j.rineng.2025.107171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automatic lip-reading systems are evolving from traditional handcrafted pipelines to advanced deep and hybrid architectures that integrate local motion modeling with long-range temporal context. This review provides a comprehensive synthesis of classical techniques and state-of-the-art learning approaches, with a specific focus on hybrid three-dimensional convolution plus Transformer or Conformer backbones and on multimodal training strategies that enable visual-only inference. Unlike previous surveys, we critically appraise datasets through the lenses of diversity, realism, robustness, and efficiency, and we foreground responsible deployment by addressing privacy, fairness, and transparency. We propose a clear taxonomy that spans classical, hybrid, and Transformer-based models. We compare their strengths and limitations for both word- and sentence-level recognition, and analyze the trade-offs between accuracy, computational cost, latency, and interpretability. The evidence indicates that lightweight hybrid models offer high accuracy with practical efficiency and that audio-as-teacher training significantly improves visual reliability when audio is unavailable. However, progress remains constrained by limited demographic and linguistic coverage, a reliance on studio-style capture, and uneven robustness to real-world challenges like pose, illumination, motion blur, and occlusion. The review concludes with a focused call to action: we must build multilingual and demographically balanced corpora with standardized robustness testing; develop parameter-efficient hybrid backbones suitable for edge deployment; adopt self-supervised and semi-supervised learning to reduce annotation demands; and report calibrated uncertainty, fairness diagnostics, and transparent documentation. These recommendations are intended to guide researchers toward creating scalable, reliable, and trustworthy lip-reading systems for real-world applications.</div></div>\",\"PeriodicalId\":36919,\"journal\":{\"name\":\"Results in Engineering\",\"volume\":\"28 \",\"pages\":\"Article 107171\"},\"PeriodicalIF\":7.9000,\"publicationDate\":\"2025-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Results in Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590123025032268\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Results in Engineering","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590123025032268","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

自动唇读系统正在从传统的手工制作管道发展到先进的深层和混合架构，将局部运动建模与长期时间背景相结合。这篇综述提供了经典技术和最先进的学习方法的综合，特别关注混合三维卷积加上Transformer或Conformer主干，以及实现视觉推断的多模态训练策略。与之前的调查不同，我们通过多样性、现实性、稳健性和效率等角度对数据集进行了批判性评估，并通过解决隐私、公平和透明度问题，展望了负责任的部署。我们提出了一个清晰的分类法，涵盖经典、混合和基于转换器的模型。我们比较了它们在单词级和句子级识别方面的优势和局限性，并分析了准确性、计算成本、延迟和可解释性之间的权衡。有证据表明，轻量级混合模型具有较高的准确性和实用效率，并且当音频不可用时，音频即教师培训显着提高了视觉可靠性。然而，由于人口统计和语言覆盖范围有限，依赖于工作室风格的捕捉，以及对现实世界挑战（如姿势，照明，运动模糊和遮挡）的不均匀稳健性，进展仍然受到限制。审查报告最后提出了一个重点行动呼吁：我们必须建立多语种和人口平衡的语料库，并进行标准化的稳健性测试；开发适合边缘部署的参数高效混合骨干；采用自监督和半监督学习，减少标注需求；报告校准的不确定性，公平诊断和透明的文件。这些建议旨在指导研究人员为现实世界的应用创建可扩展的、可靠的和值得信赖的唇读系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic visual lip reading: A comparative review of machine-learning approaches

Automatic lip-reading systems are evolving from traditional handcrafted pipelines to advanced deep and hybrid architectures that integrate local motion modeling with long-range temporal context. This review provides a comprehensive synthesis of classical techniques and state-of-the-art learning approaches, with a specific focus on hybrid three-dimensional convolution plus Transformer or Conformer backbones and on multimodal training strategies that enable visual-only inference. Unlike previous surveys, we critically appraise datasets through the lenses of diversity, realism, robustness, and efficiency, and we foreground responsible deployment by addressing privacy, fairness, and transparency. We propose a clear taxonomy that spans classical, hybrid, and Transformer-based models. We compare their strengths and limitations for both word- and sentence-level recognition, and analyze the trade-offs between accuracy, computational cost, latency, and interpretability. The evidence indicates that lightweight hybrid models offer high accuracy with practical efficiency and that audio-as-teacher training significantly improves visual reliability when audio is unavailable. However, progress remains constrained by limited demographic and linguistic coverage, a reliance on studio-style capture, and uneven robustness to real-world challenges like pose, illumination, motion blur, and occlusion. The review concludes with a focused call to action: we must build multilingual and demographically balanced corpora with standardized robustness testing; develop parameter-efficient hybrid backbones suitable for edge deployment; adopt self-supervised and semi-supervised learning to reduce annotation demands; and report calibrated uncertainty, fairness diagnostics, and transparent documentation. These recommendations are intended to guide researchers toward creating scalable, reliable, and trustworthy lip-reading systems for real-world applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊