Archana Chaudhari , Masuk Abdullah , Vivek Deshpande , Tushar Zanke , Samrudhi Wath , Snehashish Mulgir , Stuti Jagtap
{"title":"基于深度学习的三维残差卷积和多头注意唇读算法","authors":"Archana Chaudhari , Masuk Abdullah , Vivek Deshpande , Tushar Zanke , Samrudhi Wath , Snehashish Mulgir , Stuti Jagtap","doi":"10.1016/j.rico.2025.100608","DOIUrl":null,"url":null,"abstract":"<div><div>Lip reading, an essential yet intricate facet of communication, has seen notable progress through the application of advanced deep learning techniques. This research introduces a deep learning-based lip-reading model that integrates Conv3D layers, Multi-Head Attention mechanisms, Bidirectional LSTMs, and a Dense output layer, combined with a custom Connectionist Temporal Classification (CTC) loss function. Our comprehensive data preprocessing pipeline extracts video frames, normalizes pixel values, and converts textual alignments into numerical tokens, enabling effective model integration. The model architecture is carefully structured to capture spatiotemporal features, with Conv3D layers addressing spatial information, while Multi-Head Attention mechanisms and Bidirectional LSTMs effectively manage temporal dependencies. Residual connections and Max-Pooling layers are incorporated to enhance feature extraction and abstraction, supporting improved performance. The use of Layer Normalization and Dropout layers contributes to stable learning and mitigates overfitting. Through extensive training and evaluation, our model demonstrates a 96% accuracy rate in decoding lip movements and predicting corresponding words. The implementation of the CTC loss function allows for effective handling of variable-length sequences, further contributing to the model’s performance. This research provides a technically sound approach to lip reading, contributing to the advancement of visual speech recognition and offering potential benefits for communication accessibility among individuals with hearing impairments.</div></div>","PeriodicalId":34733,"journal":{"name":"Results in Control and Optimization","volume":"20 ","pages":"Article 100608"},"PeriodicalIF":3.2000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep learning based 3D residual convolutional and Multi-Head Attention (3D-RMA) for lip-reading\",\"authors\":\"Archana Chaudhari , Masuk Abdullah , Vivek Deshpande , Tushar Zanke , Samrudhi Wath , Snehashish Mulgir , Stuti Jagtap\",\"doi\":\"10.1016/j.rico.2025.100608\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Lip reading, an essential yet intricate facet of communication, has seen notable progress through the application of advanced deep learning techniques. This research introduces a deep learning-based lip-reading model that integrates Conv3D layers, Multi-Head Attention mechanisms, Bidirectional LSTMs, and a Dense output layer, combined with a custom Connectionist Temporal Classification (CTC) loss function. Our comprehensive data preprocessing pipeline extracts video frames, normalizes pixel values, and converts textual alignments into numerical tokens, enabling effective model integration. The model architecture is carefully structured to capture spatiotemporal features, with Conv3D layers addressing spatial information, while Multi-Head Attention mechanisms and Bidirectional LSTMs effectively manage temporal dependencies. Residual connections and Max-Pooling layers are incorporated to enhance feature extraction and abstraction, supporting improved performance. The use of Layer Normalization and Dropout layers contributes to stable learning and mitigates overfitting. Through extensive training and evaluation, our model demonstrates a 96% accuracy rate in decoding lip movements and predicting corresponding words. The implementation of the CTC loss function allows for effective handling of variable-length sequences, further contributing to the model’s performance. This research provides a technically sound approach to lip reading, contributing to the advancement of visual speech recognition and offering potential benefits for communication accessibility among individuals with hearing impairments.</div></div>\",\"PeriodicalId\":34733,\"journal\":{\"name\":\"Results in Control and Optimization\",\"volume\":\"20 \",\"pages\":\"Article 100608\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Results in Control and Optimization\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666720725000931\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Results in Control and Optimization","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666720725000931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
Deep learning based 3D residual convolutional and Multi-Head Attention (3D-RMA) for lip-reading
Lip reading, an essential yet intricate facet of communication, has seen notable progress through the application of advanced deep learning techniques. This research introduces a deep learning-based lip-reading model that integrates Conv3D layers, Multi-Head Attention mechanisms, Bidirectional LSTMs, and a Dense output layer, combined with a custom Connectionist Temporal Classification (CTC) loss function. Our comprehensive data preprocessing pipeline extracts video frames, normalizes pixel values, and converts textual alignments into numerical tokens, enabling effective model integration. The model architecture is carefully structured to capture spatiotemporal features, with Conv3D layers addressing spatial information, while Multi-Head Attention mechanisms and Bidirectional LSTMs effectively manage temporal dependencies. Residual connections and Max-Pooling layers are incorporated to enhance feature extraction and abstraction, supporting improved performance. The use of Layer Normalization and Dropout layers contributes to stable learning and mitigates overfitting. Through extensive training and evaluation, our model demonstrates a 96% accuracy rate in decoding lip movements and predicting corresponding words. The implementation of the CTC loss function allows for effective handling of variable-length sequences, further contributing to the model’s performance. This research provides a technically sound approach to lip reading, contributing to the advancement of visual speech recognition and offering potential benefits for communication accessibility among individuals with hearing impairments.