Indonesian Lip-Reading Detection and Recognition Based on Lip Shape Using Face Mesh and Long-Term Recurrent Convolutional Network

IF 2.4 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Computational Intelligence and Soft Computing Pub Date : 2024-04-18 DOI:10.1155/2024/6479124

Aripin, Abas Setiawan

{"title":"Indonesian Lip-Reading Detection and Recognition Based on Lip Shape Using Face Mesh and Long-Term Recurrent Convolutional Network","authors":"Aripin, Abas Setiawan","doi":"10.1155/2024/6479124","DOIUrl":null,"url":null,"abstract":"Communication through speech can be hindered by environmental noise, prompting the need for alternative methods such as lip reading, which bypasses auditory challenges. However, the accurate interpretation of lip movements is impeded by the uniqueness of individual lip shapes, necessitating detailed analysis. In addition, the development of an Indonesian dataset addresses the lack of diversity in existing datasets, predominantly in English, fostering more inclusive research. This study proposes an enhanced lip-reading system trained using the long-term recurrent convolutional network (LRCN) considering eight different types of lip shapes. MediaPipe Face Mesh precisely detects lip landmarks, enabling the LRCN model to recognize Indonesian utterances. Experimental results demonstrate the effectiveness of the approach, with the LRCN model with three convolutional layers (LRCN-3Conv) achieving 95.42% accuracy for word test data and 95.63% for phrases, outperforming the convolutional long short-term memory (Conv-LSTM) method. The proposed approach outperforms Conv-LSTM in terms of accuracy. Furthermore, the evaluation of the original MIRACL-VC1 dataset also produced a best accuracy of 90.67% on LRCN-3Conv compared to previous studies in the word-labeled class. The success is attributed to MediaPipe Face Mesh detection, which facilitates the accurate detection of the lip region. Leveraging advanced deep learning techniques and precise landmark detection, these findings promise improved communication accessibility for individuals facing auditory challenges.","PeriodicalId":44894,"journal":{"name":"Applied Computational Intelligence and Soft Computing","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computational Intelligence and Soft Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2024/6479124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Communication through speech can be hindered by environmental noise, prompting the need for alternative methods such as lip reading, which bypasses auditory challenges. However, the accurate interpretation of lip movements is impeded by the uniqueness of individual lip shapes, necessitating detailed analysis. In addition, the development of an Indonesian dataset addresses the lack of diversity in existing datasets, predominantly in English, fostering more inclusive research. This study proposes an enhanced lip-reading system trained using the long-term recurrent convolutional network (LRCN) considering eight different types of lip shapes. MediaPipe Face Mesh precisely detects lip landmarks, enabling the LRCN model to recognize Indonesian utterances. Experimental results demonstrate the effectiveness of the approach, with the LRCN model with three convolutional layers (LRCN-3Conv) achieving 95.42% accuracy for word test data and 95.63% for phrases, outperforming the convolutional long short-term memory (Conv-LSTM) method. The proposed approach outperforms Conv-LSTM in terms of accuracy. Furthermore, the evaluation of the original MIRACL-VC1 dataset also produced a best accuracy of 90.67% on LRCN-3Conv compared to previous studies in the word-labeled class. The success is attributed to MediaPipe Face Mesh detection, which facilitates the accurate detection of the lip region. Leveraging advanced deep learning techniques and precise landmark detection, these findings promise improved communication accessibility for individuals facing auditory challenges.

查看原文本刊更多论文

利用人脸网格和长期递归卷积网络，基于唇形检测和识别印尼语读唇语

通过语音进行交流可能会受到环境噪声的阻碍，因此需要采用读唇术等替代方法，绕过听觉障碍。然而，由于每个人嘴唇形状的独特性，准确解读嘴唇动作受到阻碍，因此必须进行详细分析。此外，印尼语数据集的开发解决了现有数据集（主要是英语数据集）缺乏多样性的问题，促进了更具包容性的研究。本研究提出了一种使用长期递归卷积网络（LRCN）训练的增强型唇读系统，考虑了八种不同类型的唇形。MediaPipe Face Mesh 可精确检测唇部地标，使 LRCN 模型能够识别印尼语。实验结果证明了该方法的有效性，具有三个卷积层的 LRCN 模型（LRCN-3Conv）在单词测试数据中的准确率达到 95.42%，在短语测试数据中的准确率达到 95.63%，优于卷积长短期记忆法（Conv-LSTM）。就准确率而言，所提出的方法优于 Conv-LSTM。此外，在原始 MIRACL-VC1 数据集的评估中，LRCN-3Conv 的准确率也达到了 90.67%，超过了之前的单词标签类研究。这一成功归功于 MediaPipe 脸部网格检测，它有助于准确检测嘴唇区域。利用先进的深度学习技术和精确的地标检测，这些研究结果有望改善面临听觉挑战的个人的交流无障碍性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Computational Intelligence and Soft Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

6.10

自引率

3.40%

发文量

审稿时长

21 weeks

期刊介绍： Applied Computational Intelligence and Soft Computing will focus on the disciplines of computer science, engineering, and mathematics. The scope of the journal includes developing applications related to all aspects of natural and social sciences by employing the technologies of computational intelligence and soft computing. The new applications of using computational intelligence and soft computing are still in development. Although computational intelligence and soft computing are established fields, the new applications of using computational intelligence and soft computing can be regarded as an emerging field, which is the focus of this journal.