Recognizing American Sign Language gestures efficiently and accurately using a hybrid transformer model.

IF 3.9 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Scientific Reports Pub Date : 2025-06-23 DOI:10.1038/s41598-025-06344-8

Mohammed Aly, Islam S Fathi

{"title":"Recognizing American Sign Language gestures efficiently and accurately using a hybrid transformer model.","authors":"Mohammed Aly, Islam S Fathi","doi":"10.1038/s41598-025-06344-8","DOIUrl":null,"url":null,"abstract":"<p><p>Gesture recognition plays a vital role in computer vision, especially for interpreting sign language and enabling human-computer interaction. Many existing methods struggle with challenges like heavy computational demands, difficulty in understanding long-range relationships, sensitivity to background noise, and poor performance in varied environments. While CNNs excel at capturing local details, they often miss the bigger picture. Vision Transformers, on the other hand, are better at modeling global context but usually require significantly more computational resources, limiting their use in real-time systems. To tackle these issues, we propose a Hybrid Transformer-CNN model that combines the strengths of both architectures. Our approach begins with CNN layers that extract detailed local features from both the overall hand and specific hand regions. These CNN features are then refined by a Vision Transformer module, which captures long-range dependencies and global contextual information within the gesture. This integration allows the model to effectively recognize subtle hand movements while maintaining computational efficiency. Tested on the ASL Alphabet dataset, our model achieves a high accuracy of 99.97%, runs at 110 frames per second, and requires only 5.0 GFLOPs-much less than traditional Vision Transformer models, which need over twice the computational power. Central to this success is our feature fusion strategy using element-wise multiplication, which helps the model focus on important gesture details while suppressing background noise. Additionally, we employ advanced data augmentation techniques and a training approach incorporating contrastive learning and domain adaptation to boost robustness. Overall, this work offers a practical and powerful solution for gesture recognition, striking an optimal balance between accuracy, speed, and efficiency-an important step toward real-world applications.</p>","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"20253"},"PeriodicalIF":3.9000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12185765/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-025-06344-8","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Gesture recognition plays a vital role in computer vision, especially for interpreting sign language and enabling human-computer interaction. Many existing methods struggle with challenges like heavy computational demands, difficulty in understanding long-range relationships, sensitivity to background noise, and poor performance in varied environments. While CNNs excel at capturing local details, they often miss the bigger picture. Vision Transformers, on the other hand, are better at modeling global context but usually require significantly more computational resources, limiting their use in real-time systems. To tackle these issues, we propose a Hybrid Transformer-CNN model that combines the strengths of both architectures. Our approach begins with CNN layers that extract detailed local features from both the overall hand and specific hand regions. These CNN features are then refined by a Vision Transformer module, which captures long-range dependencies and global contextual information within the gesture. This integration allows the model to effectively recognize subtle hand movements while maintaining computational efficiency. Tested on the ASL Alphabet dataset, our model achieves a high accuracy of 99.97%, runs at 110 frames per second, and requires only 5.0 GFLOPs-much less than traditional Vision Transformer models, which need over twice the computational power. Central to this success is our feature fusion strategy using element-wise multiplication, which helps the model focus on important gesture details while suppressing background noise. Additionally, we employ advanced data augmentation techniques and a training approach incorporating contrastive learning and domain adaptation to boost robustness. Overall, this work offers a practical and powerful solution for gesture recognition, striking an optimal balance between accuracy, speed, and efficiency-an important step toward real-world applications.

Abstract Image

查看原文本刊更多论文

使用混合变压器模型高效准确地识别美国手语手势。

手势识别在计算机视觉中起着至关重要的作用，特别是在手语解释和人机交互方面。许多现有的方法都面临着一些挑战，比如计算量大、难以理解长期关系、对背景噪声敏感以及在不同环境下的性能差。虽然cnn擅长捕捉局部细节，但他们往往忽略了大局。另一方面，视觉变形器在建模全局上下文方面做得更好，但通常需要更多的计算资源，限制了它们在实时系统中的使用。为了解决这些问题，我们提出了一种混合变压器- cnn模型，它结合了两种架构的优势。我们的方法从CNN层开始，从整个手和特定的手区域提取详细的局部特征。这些CNN特征然后由视觉转换器模块进行细化，该模块捕获手势中的远程依赖关系和全局上下文信息。这种整合使模型在保持计算效率的同时有效地识别细微的手部动作。在ASL字母表数据集上测试，我们的模型达到了99.97%的高精度，运行速度为每秒110帧，只需要5.0 gflops，远远低于传统的视觉变压器模型，后者需要两倍以上的计算能力。这一成功的核心是我们使用元素智能乘法的特征融合策略，这有助于模型专注于重要的手势细节，同时抑制背景噪声。此外，我们采用先进的数据增强技术和结合对比学习和领域适应的训练方法来提高鲁棒性。总的来说，这项工作为手势识别提供了一个实用而强大的解决方案，在准确性、速度和效率之间取得了最佳平衡——这是迈向现实世界应用的重要一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientific Reports Natural Science Disciplines-

CiteScore

7.50

自引率

4.30%

发文量

19567

审稿时长

3.9 months

期刊介绍： We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.