Spatial–temporal attention with graph and general neural network-based sign language recognition

IF 3.7 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Analysis and Applications Pub Date : 2024-04-04 DOI:10.1007/s10044-024-01229-4

Abu Saleh Musa Miah, Md. Al Mehedi Hasan, Yuichi Okuyama, Yoichi Tomioka, Jungpil Shin

{"title":"Spatial–temporal attention with graph and general neural network-based sign language recognition","authors":"Abu Saleh Musa Miah, Md. Al Mehedi Hasan, Yuichi Okuyama, Yoichi Tomioka, Jungpil Shin","doi":"10.1007/s10044-024-01229-4","DOIUrl":null,"url":null,"abstract":"<p>Automatic sign language recognition (SLR) stands as a vital aspect within the realms of human–computer interaction and computer vision, facilitating the conversion of hand signs utilized by individuals with significant hearing and speech impairments into equivalent text or voice. Researchers have recently used hand skeleton joint information instead of the image pixel due to light illumination and complex background-bound problems. However, besides the hand information, body motion and facial gestures play an essential role in expressing sign language emotion. Also, a few researchers have been working to develop an SLR system by taking a multi-gesture dataset, but their performance accuracy and time complexity are not sufficient. In light of these limitations, we introduce a spatial and temporal attention model amalgamated with a general neural network designed for the SLR system. The main idea of our architecture is first to construct a fully connected graph to project the skeleton information. We employ self-attention mechanisms to extract insights from node and edge features across spatial and temporal domains. Our architecture bifurcates into three branches: a graph-based spatial branch, a graph-based temporal branch, and a general neural network branch, which collectively synergize to contribute to the final feature integration. Specifically, the spatial branch discerns spatial dependencies, while the temporal branch amplifies temporal dependencies embedded within the sequential hand skeleton data. Further, the general neural network branch enhances the architecture’s generalization capabilities, bolstering its robustness. In our evaluation, utilizing the Mexican Sign Language (MSL), Pakistani Sign Language (PSL) datasets, and American Sign Language Large Video dataset which comprises 3D joint coordinates for face, body, and hands that conducted experiments on individual gestures and their combinations. Impressively, our model demonstrated notable efficacy, achieving an accuracy rate of 99.96% for the MSL dataset, 92.00% for PSL, and 26.00% for the ASLLVD dataset, which includes more than 2700 classes. These exemplary performance metrics, coupled with the model’s computationally efficient profile, underscore its preeminence compared to contemporaneous methodologies in the field.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"23 1","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Analysis and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10044-024-01229-4","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic sign language recognition (SLR) stands as a vital aspect within the realms of human–computer interaction and computer vision, facilitating the conversion of hand signs utilized by individuals with significant hearing and speech impairments into equivalent text or voice. Researchers have recently used hand skeleton joint information instead of the image pixel due to light illumination and complex background-bound problems. However, besides the hand information, body motion and facial gestures play an essential role in expressing sign language emotion. Also, a few researchers have been working to develop an SLR system by taking a multi-gesture dataset, but their performance accuracy and time complexity are not sufficient. In light of these limitations, we introduce a spatial and temporal attention model amalgamated with a general neural network designed for the SLR system. The main idea of our architecture is first to construct a fully connected graph to project the skeleton information. We employ self-attention mechanisms to extract insights from node and edge features across spatial and temporal domains. Our architecture bifurcates into three branches: a graph-based spatial branch, a graph-based temporal branch, and a general neural network branch, which collectively synergize to contribute to the final feature integration. Specifically, the spatial branch discerns spatial dependencies, while the temporal branch amplifies temporal dependencies embedded within the sequential hand skeleton data. Further, the general neural network branch enhances the architecture’s generalization capabilities, bolstering its robustness. In our evaluation, utilizing the Mexican Sign Language (MSL), Pakistani Sign Language (PSL) datasets, and American Sign Language Large Video dataset which comprises 3D joint coordinates for face, body, and hands that conducted experiments on individual gestures and their combinations. Impressively, our model demonstrated notable efficacy, achieving an accuracy rate of 99.96% for the MSL dataset, 92.00% for PSL, and 26.00% for the ASLLVD dataset, which includes more than 2700 classes. These exemplary performance metrics, coupled with the model’s computationally efficient profile, underscore its preeminence compared to contemporaneous methodologies in the field.

查看原文本刊更多论文

基于图形和通用神经网络的时空注意力手语识别

自动手语识别（SLR）是人机交互和计算机视觉领域的一个重要方面，它有助于将有严重听力和语言障碍的人使用的手势转换成等效的文本或语音。由于光照和复杂的背景约束问题，研究人员最近使用手部骨骼关节信息代替图像像素。然而，除了手部信息外，身体动作和面部手势在表达手语情感方面也起着至关重要的作用。此外，也有一些研究人员致力于通过获取多手势数据集来开发单反系统，但其性能精度和时间复杂度都不够高。鉴于这些局限性，我们引入了一种时空注意力模型，并将其与专为 SLR 系统设计的通用神经网络相结合。我们架构的主要思路是首先构建一个全连接图来投射骨架信息。我们采用自我注意机制，从跨时空领域的节点和边缘特征中提取洞察力。我们的架构分为三个分支：基于图的空间分支、基于图的时间分支和通用神经网络分支，它们共同协同，为最终的特征整合做出贡献。具体来说，空间分支可识别空间依赖关系，而时间分支则可放大顺序手骨架数据中的时间依赖关系。此外，通用神经网络分支增强了架构的泛化能力，从而提高了其鲁棒性。在评估中，我们利用墨西哥手语（MSL）、巴基斯坦手语（PSL）数据集和美国手语大型视频数据集（包括面部、身体和手部的三维关节坐标）对单个手势及其组合进行了实验。令人印象深刻的是，我们的模型展示了显著的功效，MSL 数据集的准确率达到 99.96%，PSL 数据集的准确率达到 92.00%，包含 2700 多个类别的 ASLLVD 数据集的准确率达到 26.00%。这些堪称典范的性能指标，加上该模型的高效计算特性，凸显了它在该领域与同时代方法相比的卓越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Analysis and Applications 工程技术-计算机：人工智能

CiteScore

7.40

自引率

2.60%

发文量

审稿时长

13.5 months

期刊介绍： The journal publishes high quality articles in areas of fundamental research in intelligent pattern analysis and applications in computer science and engineering. It aims to provide a forum for original research which describes novel pattern analysis techniques and industrial applications of the current technology. In addition, the journal will also publish articles on pattern analysis applications in medical imaging. The journal solicits articles that detail new technology and methods for pattern recognition and analysis in applied domains including, but not limited to, computer vision and image processing, speech analysis, robotics, multimedia, document analysis, character recognition, knowledge engineering for pattern recognition, fractal analysis, and intelligent control. The journal publishes articles on the use of advanced pattern recognition and analysis methods including statistical techniques, neural networks, genetic algorithms, fuzzy pattern recognition, machine learning, and hardware implementations which are either relevant to the development of pattern analysis as a research area or detail novel pattern analysis applications. Papers proposing new classifier systems or their development, pattern analysis systems for real-time applications, fuzzy and temporal pattern recognition and uncertainty management in applied pattern recognition are particularly solicited.