自动视觉唇读:机器学习方法的比较回顾

IF 7.9 Q1 ENGINEERING, MULTIDISCIPLINARY
Khosro Rezaee , Maryam Yeganeh
{"title":"自动视觉唇读:机器学习方法的比较回顾","authors":"Khosro Rezaee ,&nbsp;Maryam Yeganeh","doi":"10.1016/j.rineng.2025.107171","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic lip-reading systems are evolving from traditional handcrafted pipelines to advanced deep and hybrid architectures that integrate local motion modeling with long-range temporal context. This review provides a comprehensive synthesis of classical techniques and state-of-the-art learning approaches, with a specific focus on hybrid three-dimensional convolution plus Transformer or Conformer backbones and on multimodal training strategies that enable visual-only inference. Unlike previous surveys, we critically appraise datasets through the lenses of diversity, realism, robustness, and efficiency, and we foreground responsible deployment by addressing privacy, fairness, and transparency. We propose a clear taxonomy that spans classical, hybrid, and Transformer-based models. We compare their strengths and limitations for both word- and sentence-level recognition, and analyze the trade-offs between accuracy, computational cost, latency, and interpretability. The evidence indicates that lightweight hybrid models offer high accuracy with practical efficiency and that audio-as-teacher training significantly improves visual reliability when audio is unavailable. However, progress remains constrained by limited demographic and linguistic coverage, a reliance on studio-style capture, and uneven robustness to real-world challenges like pose, illumination, motion blur, and occlusion. The review concludes with a focused call to action: we must build multilingual and demographically balanced corpora with standardized robustness testing; develop parameter-efficient hybrid backbones suitable for edge deployment; adopt self-supervised and semi-supervised learning to reduce annotation demands; and report calibrated uncertainty, fairness diagnostics, and transparent documentation. These recommendations are intended to guide researchers toward creating scalable, reliable, and trustworthy lip-reading systems for real-world applications.</div></div>","PeriodicalId":36919,"journal":{"name":"Results in Engineering","volume":"28 ","pages":"Article 107171"},"PeriodicalIF":7.9000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic visual lip reading: A comparative review of machine-learning approaches\",\"authors\":\"Khosro Rezaee ,&nbsp;Maryam Yeganeh\",\"doi\":\"10.1016/j.rineng.2025.107171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automatic lip-reading systems are evolving from traditional handcrafted pipelines to advanced deep and hybrid architectures that integrate local motion modeling with long-range temporal context. This review provides a comprehensive synthesis of classical techniques and state-of-the-art learning approaches, with a specific focus on hybrid three-dimensional convolution plus Transformer or Conformer backbones and on multimodal training strategies that enable visual-only inference. Unlike previous surveys, we critically appraise datasets through the lenses of diversity, realism, robustness, and efficiency, and we foreground responsible deployment by addressing privacy, fairness, and transparency. We propose a clear taxonomy that spans classical, hybrid, and Transformer-based models. We compare their strengths and limitations for both word- and sentence-level recognition, and analyze the trade-offs between accuracy, computational cost, latency, and interpretability. The evidence indicates that lightweight hybrid models offer high accuracy with practical efficiency and that audio-as-teacher training significantly improves visual reliability when audio is unavailable. However, progress remains constrained by limited demographic and linguistic coverage, a reliance on studio-style capture, and uneven robustness to real-world challenges like pose, illumination, motion blur, and occlusion. The review concludes with a focused call to action: we must build multilingual and demographically balanced corpora with standardized robustness testing; develop parameter-efficient hybrid backbones suitable for edge deployment; adopt self-supervised and semi-supervised learning to reduce annotation demands; and report calibrated uncertainty, fairness diagnostics, and transparent documentation. These recommendations are intended to guide researchers toward creating scalable, reliable, and trustworthy lip-reading systems for real-world applications.</div></div>\",\"PeriodicalId\":36919,\"journal\":{\"name\":\"Results in Engineering\",\"volume\":\"28 \",\"pages\":\"Article 107171\"},\"PeriodicalIF\":7.9000,\"publicationDate\":\"2025-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Results in Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590123025032268\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Results in Engineering","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590123025032268","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

自动唇读系统正在从传统的手工制作管道发展到先进的深层和混合架构,将局部运动建模与长期时间背景相结合。这篇综述提供了经典技术和最先进的学习方法的综合,特别关注混合三维卷积加上Transformer或Conformer主干,以及实现视觉推断的多模态训练策略。与之前的调查不同,我们通过多样性、现实性、稳健性和效率等角度对数据集进行了批判性评估,并通过解决隐私、公平和透明度问题,展望了负责任的部署。我们提出了一个清晰的分类法,涵盖经典、混合和基于转换器的模型。我们比较了它们在单词级和句子级识别方面的优势和局限性,并分析了准确性、计算成本、延迟和可解释性之间的权衡。有证据表明,轻量级混合模型具有较高的准确性和实用效率,并且当音频不可用时,音频即教师培训显着提高了视觉可靠性。然而,由于人口统计和语言覆盖范围有限,依赖于工作室风格的捕捉,以及对现实世界挑战(如姿势,照明,运动模糊和遮挡)的不均匀稳健性,进展仍然受到限制。审查报告最后提出了一个重点行动呼吁:我们必须建立多语种和人口平衡的语料库,并进行标准化的稳健性测试;开发适合边缘部署的参数高效混合骨干;采用自监督和半监督学习,减少标注需求;报告校准的不确定性,公平诊断和透明的文件。这些建议旨在指导研究人员为现实世界的应用创建可扩展的、可靠的和值得信赖的唇读系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Automatic visual lip reading: A comparative review of machine-learning approaches
Automatic lip-reading systems are evolving from traditional handcrafted pipelines to advanced deep and hybrid architectures that integrate local motion modeling with long-range temporal context. This review provides a comprehensive synthesis of classical techniques and state-of-the-art learning approaches, with a specific focus on hybrid three-dimensional convolution plus Transformer or Conformer backbones and on multimodal training strategies that enable visual-only inference. Unlike previous surveys, we critically appraise datasets through the lenses of diversity, realism, robustness, and efficiency, and we foreground responsible deployment by addressing privacy, fairness, and transparency. We propose a clear taxonomy that spans classical, hybrid, and Transformer-based models. We compare their strengths and limitations for both word- and sentence-level recognition, and analyze the trade-offs between accuracy, computational cost, latency, and interpretability. The evidence indicates that lightweight hybrid models offer high accuracy with practical efficiency and that audio-as-teacher training significantly improves visual reliability when audio is unavailable. However, progress remains constrained by limited demographic and linguistic coverage, a reliance on studio-style capture, and uneven robustness to real-world challenges like pose, illumination, motion blur, and occlusion. The review concludes with a focused call to action: we must build multilingual and demographically balanced corpora with standardized robustness testing; develop parameter-efficient hybrid backbones suitable for edge deployment; adopt self-supervised and semi-supervised learning to reduce annotation demands; and report calibrated uncertainty, fairness diagnostics, and transparent documentation. These recommendations are intended to guide researchers toward creating scalable, reliable, and trustworthy lip-reading systems for real-world applications.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Results in Engineering
Results in Engineering Engineering-Engineering (all)
CiteScore
5.80
自引率
34.00%
发文量
441
审稿时长
47 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信