基于多库计算的孤立手语视频识别的低成本计算。

IF 2.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2025-07-30 eCollection Date: 2025-01-01 DOI:10.1371/journal.pone.0322717

A R Syulistyo, Y Tanaka, D Pramanta, N Fuengfusin, H Tamukoh

{"title":"基于多库计算的孤立手语视频识别的低成本计算。","authors":"A R Syulistyo, Y Tanaka, D Pramanta, N Fuengfusin, H Tamukoh","doi":"10.1371/journal.pone.0322717","DOIUrl":null,"url":null,"abstract":"Sign language recognition (SLR) has the potential to bridge communication gaps and empower hearing-impaired communities. To ensure the portability and accessibility of the SLR system, its implementation on a portable, server-independent device becomes imperative. This approach facilitates usage in areas without internet connectivity, addressing the need for data privacy protection. Although deep neural network models are potent, their efficacy is hindered by computational constraints on edge devices. This study delves into reservoir computing (RC), which is renowned for its edge-friendly characteristics. Through leveraging RC, our objective is to craft a cost-effective SLR system optimized for operation on edge devices with limited resources. To enhance the recognition capabilities of RC, we introduce multiple reservoirs with distinct leak rates, extracting diverse features from input videos. Prior to feeding sign language videos into the RC, we employ preprocessing via MediaPipe. This step involves extracting the coordinates of the signer's body and hand locations, referred to as keypoints, and normalizing their spatial positions. This combined approach, which incorporates keypoint extraction via MediaPipe and normalization during preprocessing, enhances the SLR system's robustness against complex background effects and varying signer positions. Experimental results demonstrate that the integration of MediaPipe and multiple reservoirs yields competitive outcomes compared with deep recurrent neural and echo state networks and promises significantly lower training times. Our proposed MRC achieved accuracies of 60.35%, 84.65%, and 91.51% for the top-1, top-5, and top-10, respectively, on the WLASL100 dataset, outperforming the deep learning-based approaches Pose-TGCN and Pose-GRU. Furthermore, because of the RC characteristics, the training time was shortened to 52.7 s, compared with 20 h for I3D and the competitive inference time.","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 7","pages":"e0322717"},"PeriodicalIF":2.6000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12310007/pdf/","citationCount":"0","resultStr":"{\"title\":\"Low-cost computation for isolated sign language video recognition with multiple reservoir computing.\",\"authors\":\"A R Syulistyo, Y Tanaka, D Pramanta, N Fuengfusin, H Tamukoh\",\"doi\":\"10.1371/journal.pone.0322717\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sign language recognition (SLR) has the potential to bridge communication gaps and empower hearing-impaired communities. To ensure the portability and accessibility of the SLR system, its implementation on a portable, server-independent device becomes imperative. This approach facilitates usage in areas without internet connectivity, addressing the need for data privacy protection. Although deep neural network models are potent, their efficacy is hindered by computational constraints on edge devices. This study delves into reservoir computing (RC), which is renowned for its edge-friendly characteristics. Through leveraging RC, our objective is to craft a cost-effective SLR system optimized for operation on edge devices with limited resources. To enhance the recognition capabilities of RC, we introduce multiple reservoirs with distinct leak rates, extracting diverse features from input videos. Prior to feeding sign language videos into the RC, we employ preprocessing via MediaPipe. This step involves extracting the coordinates of the signer's body and hand locations, referred to as keypoints, and normalizing their spatial positions. This combined approach, which incorporates keypoint extraction via MediaPipe and normalization during preprocessing, enhances the SLR system's robustness against complex background effects and varying signer positions. Experimental results demonstrate that the integration of MediaPipe and multiple reservoirs yields competitive outcomes compared with deep recurrent neural and echo state networks and promises significantly lower training times. Our proposed MRC achieved accuracies of 60.35%, 84.65%, and 91.51% for the top-1, top-5, and top-10, respectively, on the WLASL100 dataset, outperforming the deep learning-based approaches Pose-TGCN and Pose-GRU. Furthermore, because of the RC characteristics, the training time was shortened to 52.7 s, compared with 20 h for I3D and the competitive inference time.\",\"PeriodicalId\":20189,\"journal\":{\"name\":\"PLoS ONE\",\"volume\":\"20 7\",\"pages\":\"e0322717\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12310007/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS ONE\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pone.0322717\",\"RegionNum\":3,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0322717","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

手语识别（SLR）具有弥合沟通差距和增强听障社区能力的潜力。为了确保单反系统的可移植性和可访问性，在可移植的、独立于服务器的设备上实现单反系统变得势在必行。这种方法便于在没有互联网连接的地区使用，解决了数据隐私保护的需要。虽然深度神经网络模型是有效的，但它们的有效性受到边缘设备计算限制的阻碍。本研究深入研究了以边缘友好特性而闻名的储层计算（RC）。通过利用RC，我们的目标是制作一个具有成本效益的单反系统，以优化在有限资源的边缘设备上的操作。为了提高RC的识别能力，我们引入了具有不同泄漏率的多个储层，从输入视频中提取了不同的特征。在将手语视频输入RC之前，我们通过MediaPipe进行预处理。这一步包括提取签名者的身体和手的位置的坐标，称为关键点，并规范化它们的空间位置。该方法结合了通过MediaPipe提取关键点和预处理过程中的归一化，增强了单反系统对复杂背景效果和不同签名者位置的鲁棒性。实验结果表明，与深度递归神经网络和回声状态网络相比，MediaPipe和多个储层的集成产生了具有竞争力的结果，并且有望显著降低训练时间。我们提出的MRC在WLASL100数据集上对top-1、top-5和top-10的准确率分别达到了60.35%、84.65%和91.51%，优于基于深度学习的方法Pose-TGCN和Pose-GRU。此外，由于RC特征，训练时间缩短至52.7 s，而I3D和竞争推理时间为20 h。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Low-cost computation for isolated sign language video recognition with multiple reservoir computing.

Sign language recognition (SLR) has the potential to bridge communication gaps and empower hearing-impaired communities. To ensure the portability and accessibility of the SLR system, its implementation on a portable, server-independent device becomes imperative. This approach facilitates usage in areas without internet connectivity, addressing the need for data privacy protection. Although deep neural network models are potent, their efficacy is hindered by computational constraints on edge devices. This study delves into reservoir computing (RC), which is renowned for its edge-friendly characteristics. Through leveraging RC, our objective is to craft a cost-effective SLR system optimized for operation on edge devices with limited resources. To enhance the recognition capabilities of RC, we introduce multiple reservoirs with distinct leak rates, extracting diverse features from input videos. Prior to feeding sign language videos into the RC, we employ preprocessing via MediaPipe. This step involves extracting the coordinates of the signer's body and hand locations, referred to as keypoints, and normalizing their spatial positions. This combined approach, which incorporates keypoint extraction via MediaPipe and normalization during preprocessing, enhances the SLR system's robustness against complex background effects and varying signer positions. Experimental results demonstrate that the integration of MediaPipe and multiple reservoirs yields competitive outcomes compared with deep recurrent neural and echo state networks and promises significantly lower training times. Our proposed MRC achieved accuracies of 60.35%, 84.65%, and 91.51% for the top-1, top-5, and top-10, respectively, on the WLASL100 dataset, outperforming the deep learning-based approaches Pose-TGCN and Pose-GRU. Furthermore, because of the RC characteristics, the training time was shortened to 52.7 s, compared with 20 h for I3D and the competitive inference time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage