基于深度学习的三维残差卷积和多头注意唇读算法

IF 3.2 Q3 Mathematics

Results in Control and Optimization Pub Date : 2025-09-01 DOI:10.1016/j.rico.2025.100608

Archana Chaudhari , Masuk Abdullah , Vivek Deshpande , Tushar Zanke , Samrudhi Wath , Snehashish Mulgir , Stuti Jagtap

{"title":"基于深度学习的三维残差卷积和多头注意唇读算法","authors":"Archana Chaudhari , Masuk Abdullah , Vivek Deshpande , Tushar Zanke , Samrudhi Wath , Snehashish Mulgir , Stuti Jagtap","doi":"10.1016/j.rico.2025.100608","DOIUrl":null,"url":null,"abstract":"<div><div>Lip reading, an essential yet intricate facet of communication, has seen notable progress through the application of advanced deep learning techniques. This research introduces a deep learning-based lip-reading model that integrates Conv3D layers, Multi-Head Attention mechanisms, Bidirectional LSTMs, and a Dense output layer, combined with a custom Connectionist Temporal Classification (CTC) loss function. Our comprehensive data preprocessing pipeline extracts video frames, normalizes pixel values, and converts textual alignments into numerical tokens, enabling effective model integration. The model architecture is carefully structured to capture spatiotemporal features, with Conv3D layers addressing spatial information, while Multi-Head Attention mechanisms and Bidirectional LSTMs effectively manage temporal dependencies. Residual connections and Max-Pooling layers are incorporated to enhance feature extraction and abstraction, supporting improved performance. The use of Layer Normalization and Dropout layers contributes to stable learning and mitigates overfitting. Through extensive training and evaluation, our model demonstrates a 96% accuracy rate in decoding lip movements and predicting corresponding words. The implementation of the CTC loss function allows for effective handling of variable-length sequences, further contributing to the model’s performance. This research provides a technically sound approach to lip reading, contributing to the advancement of visual speech recognition and offering potential benefits for communication accessibility among individuals with hearing impairments.</div></div>","PeriodicalId":34733,"journal":{"name":"Results in Control and Optimization","volume":"20 ","pages":"Article 100608"},"PeriodicalIF":3.2000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep learning based 3D residual convolutional and Multi-Head Attention (3D-RMA) for lip-reading\",\"authors\":\"Archana Chaudhari , Masuk Abdullah , Vivek Deshpande , Tushar Zanke , Samrudhi Wath , Snehashish Mulgir , Stuti Jagtap\",\"doi\":\"10.1016/j.rico.2025.100608\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Lip reading, an essential yet intricate facet of communication, has seen notable progress through the application of advanced deep learning techniques. This research introduces a deep learning-based lip-reading model that integrates Conv3D layers, Multi-Head Attention mechanisms, Bidirectional LSTMs, and a Dense output layer, combined with a custom Connectionist Temporal Classification (CTC) loss function. Our comprehensive data preprocessing pipeline extracts video frames, normalizes pixel values, and converts textual alignments into numerical tokens, enabling effective model integration. The model architecture is carefully structured to capture spatiotemporal features, with Conv3D layers addressing spatial information, while Multi-Head Attention mechanisms and Bidirectional LSTMs effectively manage temporal dependencies. Residual connections and Max-Pooling layers are incorporated to enhance feature extraction and abstraction, supporting improved performance. The use of Layer Normalization and Dropout layers contributes to stable learning and mitigates overfitting. Through extensive training and evaluation, our model demonstrates a 96% accuracy rate in decoding lip movements and predicting corresponding words. The implementation of the CTC loss function allows for effective handling of variable-length sequences, further contributing to the model’s performance. This research provides a technically sound approach to lip reading, contributing to the advancement of visual speech recognition and offering potential benefits for communication accessibility among individuals with hearing impairments.</div></div>\",\"PeriodicalId\":34733,\"journal\":{\"name\":\"Results in Control and Optimization\",\"volume\":\"20 \",\"pages\":\"Article 100608\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Results in Control and Optimization\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666720725000931\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Results in Control and Optimization","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666720725000931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

摘要

唇读是交流的一个重要而复杂的方面，通过应用先进的深度学习技术，唇读已经取得了显著的进展。本研究引入了一种基于深度学习的唇读模型，该模型集成了Conv3D层、多头注意机制、双向lstm和密集输出层，并结合了自定义连接时间分类（CTC）损失函数。我们全面的数据预处理管道提取视频帧，规范化像素值，并将文本对齐转换为数字标记，从而实现有效的模型集成。模型结构被精心构建以捕获时空特征，Conv3D层处理空间信息，而多头注意机制和双向lstm有效地管理时间依赖性。残差连接层和最大池层的结合增强了特征提取和抽象，支持改进的性能。使用层归一化和Dropout层有助于稳定的学习和减轻过拟合。通过广泛的训练和评估，我们的模型在解码嘴唇运动和预测相应的单词方面显示出96%的准确率。CTC损失函数的实现允许有效地处理变长序列，进一步提高模型的性能。这项研究为唇读提供了一种技术上合理的方法，有助于视觉语音识别的发展，并为听力障碍患者之间的交流提供了潜在的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep learning based 3D residual convolutional and Multi-Head Attention (3D-RMA) for lip-reading

Lip reading, an essential yet intricate facet of communication, has seen notable progress through the application of advanced deep learning techniques. This research introduces a deep learning-based lip-reading model that integrates Conv3D layers, Multi-Head Attention mechanisms, Bidirectional LSTMs, and a Dense output layer, combined with a custom Connectionist Temporal Classification (CTC) loss function. Our comprehensive data preprocessing pipeline extracts video frames, normalizes pixel values, and converts textual alignments into numerical tokens, enabling effective model integration. The model architecture is carefully structured to capture spatiotemporal features, with Conv3D layers addressing spatial information, while Multi-Head Attention mechanisms and Bidirectional LSTMs effectively manage temporal dependencies. Residual connections and Max-Pooling layers are incorporated to enhance feature extraction and abstraction, supporting improved performance. The use of Layer Normalization and Dropout layers contributes to stable learning and mitigates overfitting. Through extensive training and evaluation, our model demonstrates a 96% accuracy rate in decoding lip movements and predicting corresponding words. The implementation of the CTC loss function allows for effective handling of variable-length sequences, further contributing to the model’s performance. This research provides a technically sound approach to lip reading, contributing to the advancement of visual speech recognition and offering potential benefits for communication accessibility among individuals with hearing impairments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Results in Control and Optimization Mathematics-Control and Optimization

CiteScore

3.00

自引率

0.00%

发文量

审稿时长

91 days