Video-Based Arabic Sign Language Recognition with Mediapipe and Deep Learning Techniques.

IF 2.7 Q3 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY

Journal of Imaging Pub Date : 2026-04-20 DOI:10.3390/jimaging12040177

Dana El-Rushaidat, Nour Almohammad, Raine Yeh, Kinda Fayyad

{"title":"Video-Based Arabic Sign Language Recognition with Mediapipe and Deep Learning Techniques.","authors":"Dana El-Rushaidat, Nour Almohammad, Raine Yeh, Kinda Fayyad","doi":"10.3390/jimaging12040177","DOIUrl":null,"url":null,"abstract":"<p><p>This paper addresses the critical communication barrier experienced by deaf and hearing-impaired individuals in the Arab world through the development of an affordable, video-based Arabic Sign Language (ArSL) recognition system. Designed for broad accessibility, the system eliminates specialized hardware by leveraging standard mobile or laptop cameras. Our methodology employs Mediapipe for real-time extraction of hand, face, and pose landmarks from video streams. These anatomical features are then processed by a hybrid deep learning model integrating Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), specifically Bidirectional Long Short-Term Memory (BiLSTM) layers. The CNN component captures spatial features, such as intricate hand shapes and body movements, within individual frames. Concurrently, BiLSTMs model long-term temporal dependencies and motion trajectories across consecutive frames. This integrated CNN-BiLSTM architecture is critical for generating a comprehensive spatiotemporal representation, enabling accurate differentiation of complex signs where meaning relies on both static gestures and dynamic transitions, thus preventing misclassification that CNN-only or RNN-only models would incur. Rigorously evaluated on the author-created JUST-SL dataset and the publicly available KArSL dataset, the system achieved 96% overall accuracy for JUST-SL and an impressive 99% for KArSL. These results demonstrate the system's superior accuracy compared to previous research, particularly for recognizing full Arabic words, thereby significantly enhancing communication accessibility for the deaf and hearing-impaired community.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 4","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2026-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13117685/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/jimaging12040177","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

This paper addresses the critical communication barrier experienced by deaf and hearing-impaired individuals in the Arab world through the development of an affordable, video-based Arabic Sign Language (ArSL) recognition system. Designed for broad accessibility, the system eliminates specialized hardware by leveraging standard mobile or laptop cameras. Our methodology employs Mediapipe for real-time extraction of hand, face, and pose landmarks from video streams. These anatomical features are then processed by a hybrid deep learning model integrating Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), specifically Bidirectional Long Short-Term Memory (BiLSTM) layers. The CNN component captures spatial features, such as intricate hand shapes and body movements, within individual frames. Concurrently, BiLSTMs model long-term temporal dependencies and motion trajectories across consecutive frames. This integrated CNN-BiLSTM architecture is critical for generating a comprehensive spatiotemporal representation, enabling accurate differentiation of complex signs where meaning relies on both static gestures and dynamic transitions, thus preventing misclassification that CNN-only or RNN-only models would incur. Rigorously evaluated on the author-created JUST-SL dataset and the publicly available KArSL dataset, the system achieved 96% overall accuracy for JUST-SL and an impressive 99% for KArSL. These results demonstrate the system's superior accuracy compared to previous research, particularly for recognizing full Arabic words, thereby significantly enhancing communication accessibility for the deaf and hearing-impaired community.

查看原文本刊更多论文

基于视频的阿拉伯手语识别与Mediapipe和深度学习技术。

本文通过开发一种经济实惠的基于视频的阿拉伯手语（ArSL）识别系统，解决了阿拉伯世界聋人和听障人士所经历的严重沟通障碍。该系统为广泛的可访问性而设计，通过利用标准的移动或笔记本电脑摄像头，消除了专门的硬件。我们的方法采用Mediapipe实时提取视频流中的手、脸和姿势标志。然后，这些解剖特征通过卷积神经网络（cnn）和循环神经网络（rnn）的混合深度学习模型进行处理，特别是双向长短期记忆（BiLSTM）层。CNN组件捕捉空间特征，如复杂的手部形状和身体运动，在单个帧内。同时，BiLSTMs对长期时间依赖性和跨连续帧的运动轨迹进行建模。这种集成的CNN-BiLSTM架构对于生成全面的时空表示至关重要，能够准确区分复杂的符号，其中的含义依赖于静态手势和动态转换，从而防止CNN-only或RNN-only模型可能导致的错误分类。通过对作者创建的JUST-SL数据集和公开可用的KArSL数据集进行严格评估，该系统对JUST-SL的总体准确率达到了96%，对KArSL的总体准确率达到了令人印象深刻的99%。这些结果表明，与之前的研究相比，该系统具有更高的准确性，特别是在识别完整的阿拉伯语单词方面，从而显著提高了聋人和听障群体的交流可及性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊