Video-Based Arabic Sign Language Recognition with Mediapipe and Deep Learning Techniques.

IF 2.7 Q3 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY
Dana El-Rushaidat, Nour Almohammad, Raine Yeh, Kinda Fayyad
{"title":"Video-Based Arabic Sign Language Recognition with Mediapipe and Deep Learning Techniques.","authors":"Dana El-Rushaidat, Nour Almohammad, Raine Yeh, Kinda Fayyad","doi":"10.3390/jimaging12040177","DOIUrl":null,"url":null,"abstract":"<p><p>This paper addresses the critical communication barrier experienced by deaf and hearing-impaired individuals in the Arab world through the development of an affordable, video-based Arabic Sign Language (ArSL) recognition system. Designed for broad accessibility, the system eliminates specialized hardware by leveraging standard mobile or laptop cameras. Our methodology employs Mediapipe for real-time extraction of hand, face, and pose landmarks from video streams. These anatomical features are then processed by a hybrid deep learning model integrating Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), specifically Bidirectional Long Short-Term Memory (BiLSTM) layers. The CNN component captures spatial features, such as intricate hand shapes and body movements, within individual frames. Concurrently, BiLSTMs model long-term temporal dependencies and motion trajectories across consecutive frames. This integrated CNN-BiLSTM architecture is critical for generating a comprehensive spatiotemporal representation, enabling accurate differentiation of complex signs where meaning relies on both static gestures and dynamic transitions, thus preventing misclassification that CNN-only or RNN-only models would incur. Rigorously evaluated on the author-created JUST-SL dataset and the publicly available KArSL dataset, the system achieved 96% overall accuracy for JUST-SL and an impressive 99% for KArSL. These results demonstrate the system's superior accuracy compared to previous research, particularly for recognizing full Arabic words, thereby significantly enhancing communication accessibility for the deaf and hearing-impaired community.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 4","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2026-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13117685/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/jimaging12040177","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

This paper addresses the critical communication barrier experienced by deaf and hearing-impaired individuals in the Arab world through the development of an affordable, video-based Arabic Sign Language (ArSL) recognition system. Designed for broad accessibility, the system eliminates specialized hardware by leveraging standard mobile or laptop cameras. Our methodology employs Mediapipe for real-time extraction of hand, face, and pose landmarks from video streams. These anatomical features are then processed by a hybrid deep learning model integrating Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), specifically Bidirectional Long Short-Term Memory (BiLSTM) layers. The CNN component captures spatial features, such as intricate hand shapes and body movements, within individual frames. Concurrently, BiLSTMs model long-term temporal dependencies and motion trajectories across consecutive frames. This integrated CNN-BiLSTM architecture is critical for generating a comprehensive spatiotemporal representation, enabling accurate differentiation of complex signs where meaning relies on both static gestures and dynamic transitions, thus preventing misclassification that CNN-only or RNN-only models would incur. Rigorously evaluated on the author-created JUST-SL dataset and the publicly available KArSL dataset, the system achieved 96% overall accuracy for JUST-SL and an impressive 99% for KArSL. These results demonstrate the system's superior accuracy compared to previous research, particularly for recognizing full Arabic words, thereby significantly enhancing communication accessibility for the deaf and hearing-impaired community.

基于视频的阿拉伯手语识别与Mediapipe和深度学习技术。
本文通过开发一种经济实惠的基于视频的阿拉伯手语(ArSL)识别系统,解决了阿拉伯世界聋人和听障人士所经历的严重沟通障碍。该系统为广泛的可访问性而设计,通过利用标准的移动或笔记本电脑摄像头,消除了专门的硬件。我们的方法采用Mediapipe实时提取视频流中的手、脸和姿势标志。然后,这些解剖特征通过卷积神经网络(cnn)和循环神经网络(rnn)的混合深度学习模型进行处理,特别是双向长短期记忆(BiLSTM)层。CNN组件捕捉空间特征,如复杂的手部形状和身体运动,在单个帧内。同时,BiLSTMs对长期时间依赖性和跨连续帧的运动轨迹进行建模。这种集成的CNN-BiLSTM架构对于生成全面的时空表示至关重要,能够准确区分复杂的符号,其中的含义依赖于静态手势和动态转换,从而防止CNN-only或RNN-only模型可能导致的错误分类。通过对作者创建的JUST-SL数据集和公开可用的KArSL数据集进行严格评估,该系统对JUST-SL的总体准确率达到了96%,对KArSL的总体准确率达到了令人印象深刻的99%。这些结果表明,与之前的研究相比,该系统具有更高的准确性,特别是在识别完整的阿拉伯语单词方面,从而显著提高了聋人和听障群体的交流可及性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Imaging
Journal of Imaging Medicine-Radiology, Nuclear Medicine and Imaging
CiteScore
5.90
自引率
6.20%
发文量
303
审稿时长
7 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书