{"title":"From pixels to letters: A high-accuracy CPU-real-time American Sign Language detection pipeline","authors":"Jonas Rheiner , Daniel Kerger , Matthias Drüppel","doi":"10.1016/j.mlwa.2025.100650","DOIUrl":null,"url":null,"abstract":"<div><div>We introduce a CPU-real-time American Sign Language (ASL) recognition system designed to bridge communication barriers between the deaf community and the broader public. Our multi-step pipeline includes preprocessing, a hand detection stage, and a classification model using a MobileNetV3 convolutional neural network backbone followed by a classification head. We train and evaluate our model using a combined dataset of 252k labeled images from two distinct ASL datasets. This increases generalization on unseen data and strengthens our evaluation. We employ a two-step training: The backbone is initialized through transfer learning and frozen for the initial training of the head. A second training phase with lower learning rate and unfrozen weights yields an exceptional test accuracy of 99.98% and <span><math><mo>></mo></math></span>99.93% on the two datasets - setting new benchmarks for ASL detection. With an CPU-inference time under 500 ms, it ensures real-time performance on affordable hardware. We propose a straightforward method to determine the amount of data needed for validation and testing and to quantify the remaining statistical error. For this we calculate accuracy as a function of validation set size, and thus ensure sufficient data is allocated for evaluation. Model interpretability is enhanced using Gradient-weighted Class Activation Mapping (Grad-CAM), which provides visual explanations by highlighting key image regions influencing predictions. This transparency fosters trust and improves user understanding of the system’s decisions. Our system sets new benchmarks in ASL gesture recognition by closing the accuracy gap of state-of-the-art solutions, while offering broad applicability through CPU-real-time inference and interpretability of our predictions.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"20 ","pages":"Article 100650"},"PeriodicalIF":4.9000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025000337","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We introduce a CPU-real-time American Sign Language (ASL) recognition system designed to bridge communication barriers between the deaf community and the broader public. Our multi-step pipeline includes preprocessing, a hand detection stage, and a classification model using a MobileNetV3 convolutional neural network backbone followed by a classification head. We train and evaluate our model using a combined dataset of 252k labeled images from two distinct ASL datasets. This increases generalization on unseen data and strengthens our evaluation. We employ a two-step training: The backbone is initialized through transfer learning and frozen for the initial training of the head. A second training phase with lower learning rate and unfrozen weights yields an exceptional test accuracy of 99.98% and 99.93% on the two datasets - setting new benchmarks for ASL detection. With an CPU-inference time under 500 ms, it ensures real-time performance on affordable hardware. We propose a straightforward method to determine the amount of data needed for validation and testing and to quantify the remaining statistical error. For this we calculate accuracy as a function of validation set size, and thus ensure sufficient data is allocated for evaluation. Model interpretability is enhanced using Gradient-weighted Class Activation Mapping (Grad-CAM), which provides visual explanations by highlighting key image regions influencing predictions. This transparency fosters trust and improves user understanding of the system’s decisions. Our system sets new benchmarks in ASL gesture recognition by closing the accuracy gap of state-of-the-art solutions, while offering broad applicability through CPU-real-time inference and interpretability of our predictions.