From pixels to letters: A high-accuracy CPU-real-time American Sign Language detection pipeline

IF 4.9

Machine learning with applications Pub Date : 2025-04-08 DOI:10.1016/j.mlwa.2025.100650

Jonas Rheiner , Daniel Kerger , Matthias Drüppel

{"title":"From pixels to letters: A high-accuracy CPU-real-time American Sign Language detection pipeline","authors":"Jonas Rheiner , Daniel Kerger , Matthias Drüppel","doi":"10.1016/j.mlwa.2025.100650","DOIUrl":null,"url":null,"abstract":"<div><div>We introduce a CPU-real-time American Sign Language (ASL) recognition system designed to bridge communication barriers between the deaf community and the broader public. Our multi-step pipeline includes preprocessing, a hand detection stage, and a classification model using a MobileNetV3 convolutional neural network backbone followed by a classification head. We train and evaluate our model using a combined dataset of 252k labeled images from two distinct ASL datasets. This increases generalization on unseen data and strengthens our evaluation. We employ a two-step training: The backbone is initialized through transfer learning and frozen for the initial training of the head. A second training phase with lower learning rate and unfrozen weights yields an exceptional test accuracy of 99.98% and <span><math><mo>></mo></math></span>99.93% on the two datasets - setting new benchmarks for ASL detection. With an CPU-inference time under 500 ms, it ensures real-time performance on affordable hardware. We propose a straightforward method to determine the amount of data needed for validation and testing and to quantify the remaining statistical error. For this we calculate accuracy as a function of validation set size, and thus ensure sufficient data is allocated for evaluation. Model interpretability is enhanced using Gradient-weighted Class Activation Mapping (Grad-CAM), which provides visual explanations by highlighting key image regions influencing predictions. This transparency fosters trust and improves user understanding of the system’s decisions. Our system sets new benchmarks in ASL gesture recognition by closing the accuracy gap of state-of-the-art solutions, while offering broad applicability through CPU-real-time inference and interpretability of our predictions.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"20 ","pages":"Article 100650"},"PeriodicalIF":4.9000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025000337","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We introduce a CPU-real-time American Sign Language (ASL) recognition system designed to bridge communication barriers between the deaf community and the broader public. Our multi-step pipeline includes preprocessing, a hand detection stage, and a classification model using a MobileNetV3 convolutional neural network backbone followed by a classification head. We train and evaluate our model using a combined dataset of 252k labeled images from two distinct ASL datasets. This increases generalization on unseen data and strengthens our evaluation. We employ a two-step training: The backbone is initialized through transfer learning and frozen for the initial training of the head. A second training phase with lower learning rate and unfrozen weights yields an exceptional test accuracy of 99.98% and

>

99.93% on the two datasets - setting new benchmarks for ASL detection. With an CPU-inference time under 500 ms, it ensures real-time performance on affordable hardware. We propose a straightforward method to determine the amount of data needed for validation and testing and to quantify the remaining statistical error. For this we calculate accuracy as a function of validation set size, and thus ensure sufficient data is allocated for evaluation. Model interpretability is enhanced using Gradient-weighted Class Activation Mapping (Grad-CAM), which provides visual explanations by highlighting key image regions influencing predictions. This transparency fosters trust and improves user understanding of the system’s decisions. Our system sets new benchmarks in ASL gesture recognition by closing the accuracy gap of state-of-the-art solutions, while offering broad applicability through CPU-real-time inference and interpretability of our predictions.

Abstract Image

查看原文本刊更多论文

从像素到字母：高精度 CPU 实时美国手语检测管道

我们介绍了一个cpu实时美国手语（ASL）识别系统，旨在弥合聋人社区与广大公众之间的沟通障碍。我们的多步骤流程包括预处理、手部检测阶段和使用MobileNetV3卷积神经网络主干和分类头的分类模型。我们使用来自两个不同的ASL数据集的252k标记图像的组合数据集来训练和评估我们的模型。这增加了对未知数据的泛化，并加强了我们的评估。我们采用两步训练：通过迁移学习初始化骨干，并冻结头部的初始训练。第二个训练阶段具有较低的学习率和未冻结的权重，在两个数据集上产生了99.98%和99.93%的异常测试准确率-为美国手语检测设定了新的基准。由于cpu推断时间低于500 ms，它可以确保在价格合理的硬件上实现实时性能。我们提出了一种直接的方法来确定验证和测试所需的数据量，并量化剩余的统计误差。为此，我们将准确性计算为验证集大小的函数，从而确保分配足够的数据用于评估。使用梯度加权类激活映射（Grad-CAM）增强了模型的可解释性，该映射通过突出显示影响预测的关键图像区域来提供可视化解释。这种透明性促进了信任，并提高了用户对系统决策的理解。我们的系统通过缩小最先进解决方案的准确性差距，为ASL手势识别设定了新的基准，同时通过cpu实时推理和预测的可解释性提供广泛的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications

自引率

0.00%

发文量

审稿时长

98 days