STDNet：通过短期时间依赖建模改进唇读

Q1 Computer Science

Virtual Reality Intelligent Hardware Pub Date : 2025-04-01 DOI:10.1016/j.vrih.2024.07.003

Xiaoer Wu , Zhenhua Tan , Ziwei Cheng , Yuran Ru

{"title":"STDNet：通过短期时间依赖建模改进唇读","authors":"Xiaoer Wu , Zhenhua Tan , Ziwei Cheng , Yuran Ru","doi":"10.1016/j.vrih.2024.07.003","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Lip reading uses lip images for visual speech recognition. Deep-learning-based lip reading has greatly improved performance in current datasets; however, most existing research ignores the significance of short-term temporal dependencies of lip-shape variations between adjacent frames, which leaves space for further improvement in feature extraction.</div></div><div><h3>Methods</h3><div>This article presents a spatiotemporal feature fusion network (STDNet) that compensates for the deficiencies of current lip-reading approaches in short-term temporal dependency modeling. Specifically, to distinguish more similar and intricate content, STDNet adds a temporal feature extraction branch based on a 3D-CNN, which enhances the learning of dynamic lip movements in adjacent frames while not affecting spatial feature extraction. In particular, we designed a local–temporal block, which aggregates interframe differences, strengthening the relationship between various local lip regions through multiscale convolution. We incorporated the squeeze-and-excitation mechanism into the Global-Temporal Block, which processes a single frame as an independent unitto learn temporal variations across the entire lip region more effectively. Furthermore, attention pooling was introduced to highlight meaningful frames containing key semantic information for the target word.</div></div><div><h3>Results</h3><div>Experimental results demonstrated STDNet's superior performance on the LRW and LRW-1000, achieving word-level recognition accuracies of 90.2% and 53.56%, respectively. Extensive ablation experiments verified the rationality and effectiveness of its modules.</div></div><div><h3>Conclusions</h3><div>The proposed model effectively addresses short-term temporal dependency limitations in lip reading, and improves the temporal robustness of the model against variable-length sequences. These advancements validate the importance of explicit short-term dynamics modeling for practical lip-reading systems.</div></div>","PeriodicalId":33538,"journal":{"name":"Virtual Reality Intelligent Hardware","volume":"7 2","pages":"Pages 173-187"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"STDNet: Improved lip reading via short-term temporal dependency modeling\",\"authors\":\"Xiaoer Wu , Zhenhua Tan , Ziwei Cheng , Yuran Ru\",\"doi\":\"10.1016/j.vrih.2024.07.003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Lip reading uses lip images for visual speech recognition. Deep-learning-based lip reading has greatly improved performance in current datasets; however, most existing research ignores the significance of short-term temporal dependencies of lip-shape variations between adjacent frames, which leaves space for further improvement in feature extraction.</div></div><div><h3>Methods</h3><div>This article presents a spatiotemporal feature fusion network (STDNet) that compensates for the deficiencies of current lip-reading approaches in short-term temporal dependency modeling. Specifically, to distinguish more similar and intricate content, STDNet adds a temporal feature extraction branch based on a 3D-CNN, which enhances the learning of dynamic lip movements in adjacent frames while not affecting spatial feature extraction. In particular, we designed a local–temporal block, which aggregates interframe differences, strengthening the relationship between various local lip regions through multiscale convolution. We incorporated the squeeze-and-excitation mechanism into the Global-Temporal Block, which processes a single frame as an independent unitto learn temporal variations across the entire lip region more effectively. Furthermore, attention pooling was introduced to highlight meaningful frames containing key semantic information for the target word.</div></div><div><h3>Results</h3><div>Experimental results demonstrated STDNet's superior performance on the LRW and LRW-1000, achieving word-level recognition accuracies of 90.2% and 53.56%, respectively. Extensive ablation experiments verified the rationality and effectiveness of its modules.</div></div><div><h3>Conclusions</h3><div>The proposed model effectively addresses short-term temporal dependency limitations in lip reading, and improves the temporal robustness of the model against variable-length sequences. These advancements validate the importance of explicit short-term dynamics modeling for practical lip-reading systems.</div></div>\",\"PeriodicalId\":33538,\"journal\":{\"name\":\"Virtual Reality Intelligent Hardware\",\"volume\":\"7 2\",\"pages\":\"Pages 173-187\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Virtual Reality Intelligent Hardware\",\"FirstCategoryId\":\"1093\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S209657962400038X\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virtual Reality Intelligent Hardware","FirstCategoryId":"1093","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S209657962400038X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

摘要

唇读使用唇形图像进行视觉语音识别。基于深度学习的唇读在当前数据集上的性能有了很大的提高；然而，大多数现有研究忽略了相邻帧之间唇形变化的短期时间依赖性的重要性，这为特征提取留下了进一步改进的空间。方法本文提出了一种时空特征融合网络（STDNet），弥补了当前唇读方法在短期时间依赖建模中的不足。具体来说，为了区分更相似和复杂的内容，STDNet增加了一个基于3D-CNN的时间特征提取分支，在不影响空间特征提取的同时增强了对相邻帧中动态嘴唇运动的学习。特别地，我们设计了一个局部-时间块，它聚集帧间的差异，通过多尺度卷积加强各个局部唇区域之间的关系。我们将挤压和激励机制整合到Global-Temporal Block中，它将单个帧作为一个独立的单元来处理，从而更有效地学习整个唇部区域的时间变化。此外，引入注意池来突出包含目标词关键语义信息的有意义框架。结果STDNet在LRW和LRW-1000上的识别准确率分别达到90.2%和53.56%。大量烧蚀实验验证了其模块的合理性和有效性。结论该模型有效地解决了唇读的短期时间依赖性限制，提高了模型对变长序列的时间鲁棒性。这些进展验证了显式短期动态建模对实际唇读系统的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

STDNet: Improved lip reading via short-term temporal dependency modeling

Background

Lip reading uses lip images for visual speech recognition. Deep-learning-based lip reading has greatly improved performance in current datasets; however, most existing research ignores the significance of short-term temporal dependencies of lip-shape variations between adjacent frames, which leaves space for further improvement in feature extraction.

Methods

This article presents a spatiotemporal feature fusion network (STDNet) that compensates for the deficiencies of current lip-reading approaches in short-term temporal dependency modeling. Specifically, to distinguish more similar and intricate content, STDNet adds a temporal feature extraction branch based on a 3D-CNN, which enhances the learning of dynamic lip movements in adjacent frames while not affecting spatial feature extraction. In particular, we designed a local–temporal block, which aggregates interframe differences, strengthening the relationship between various local lip regions through multiscale convolution. We incorporated the squeeze-and-excitation mechanism into the Global-Temporal Block, which processes a single frame as an independent unitto learn temporal variations across the entire lip region more effectively. Furthermore, attention pooling was introduced to highlight meaningful frames containing key semantic information for the target word.

Results

Experimental results demonstrated STDNet's superior performance on the LRW and LRW-1000, achieving word-level recognition accuracies of 90.2% and 53.56%, respectively. Extensive ablation experiments verified the rationality and effectiveness of its modules.

Conclusions

The proposed model effectively addresses short-term temporal dependency limitations in lip reading, and improves the temporal robustness of the model against variable-length sequences. These advancements validate the importance of explicit short-term dynamics modeling for practical lip-reading systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Virtual Reality Intelligent Hardware Computer Science-Computer Graphics and Computer-Aided Design

CiteScore

6.40

自引率

0.00%

发文量

审稿时长

12 weeks