TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading

IF 3.1 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation Pub Date : 2025-07-23 DOI:10.1016/j.jvcir.2025.104540

Byung Hoon Lee , Wooseok Shin , Sung Won Han

{"title":"TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading","authors":"Byung Hoon Lee , Wooseok Shin , Sung Won Han","doi":"10.1016/j.jvcir.2025.104540","DOIUrl":null,"url":null,"abstract":"<div><div>The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (<span><span>https://github.com/Leebh-kor/TD3Net</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104540"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325001543","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (https://github.com/Leebh-kor/TD3Net).

查看原文本刊更多论文

TD3Net：一个用于唇读的时间密集连接多扩展卷积网络

单词级唇读方法通常采用两阶段框架，分别采用前端和后端架构来模拟动态唇读运动。每个组件都已被广泛研究，并且在后端架构中，时间卷积网络（tcn）已被广泛采用于最先进的方法中。最近，密集跳跃连接被引入到tcn中，以缓解接收野的有限密度，从而改善复杂时间表征的建模。然而，他们的表现仍然受到限制，由于潜在的信息丢失关于唇运动的连续性，在接受野的盲点造成的。为了解决这一限制，我们提出了TD3Net，这是一个时间密集连接的多扩展卷积网络，它结合了密集跳跃连接和多扩展时间卷积作为后端架构。TD3Net通过对跳接特征应用不同的扩张因子，覆盖了宽广而密集的感受野，没有盲点。使用两个大型公开数据集（LRW）和LRW-1000在单词级唇读任务上的实验结果表明，所提出的方法达到了与最先进方法相当的性能。与现有的基于tcn的后端架构相比，它以更少的参数和更低的浮点操作实现了更高的精度。此外，可视化结果表明，我们的方法有效地利用了不同的时间特征，同时保持了时间连续性，在唇读系统中具有显著的优势。代码可在我们的GitHub存储库（https://github.com/Leebh-kor/TD3Net）获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Visual Communication and Image Representation 工程技术-计算机：软件工程

CiteScore

5.40

自引率

11.50%

发文量

188

审稿时长

9.9 months

期刊介绍： The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.