A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human–robot interaction

IF 11.4 1区计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Robotics and Computer-integrated Manufacturing Pub Date : 2025-09-02 DOI:10.1016/j.rcim.2025.103110

Reza Jalayer , Masoud Jalayer , Carlotta Orsenigo , Masayoshi Tomizuka

{"title":"A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human–robot interaction","authors":"Reza Jalayer , Masoud Jalayer , Carlotta Orsenigo , Masayoshi Tomizuka","doi":"10.1016/j.rcim.2025.103110","DOIUrl":null,"url":null,"abstract":"<div><div>Hand-based analysis, including hand detection, segmentation, and gesture recognition, plays a pivotal role in enabling natural and intuitive human–robot interaction (HRI). Recent advances in vision-based deep learning (DL) have significantly improved robots’ ability to interpret hand cues across diverse settings. However, previous reviews have not addressed all three tasks collectively or focused on recent DL architectures. Filling this gap, we review recent studies at the intersection of DL and hand-based interaction in HRI. We structure the literature around three core tasks, i.e. hand detection, segmentation, and gesture recognition, highlighting DL models, dataset characteristics, evaluation metrics, and key challenges for each. We further examine the application of these models across industrial, assistive, social, aerial, and space robotics domains. We identify the dominant role of Convolutional and Recurrent Neural Networks (CNNs and RNNs), as well as emerging approaches such as attention-based models (Transformers), uncertainty-aware models, Graph Neural Networks (GNNs), and foundation models, i.e. Vision-Language Models (VLMs) and Large Language Models (LLMs). Our analysis reveals gaps, including the scarcity of HRI-specific datasets, underrepresentation of multi-hand and multi-user scenarios, limited use of RGBD and multi-modal inputs, weak cross-dataset generalization, and inconsistent real-time benchmarking. Dynamic and long-range gestures, multi-view setups, and context-aware understanding also remain relatively underexplored. Despite these limitations, promising directions have emerged, such as multi-modal fusion, use of foundation models for intent reasoning, and the development of lightweight architectures for deployment. This review offers a consolidated foundation to support future research on robust and context-aware DL systems for hand-centric HRI.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"97 ","pages":"Article 103110"},"PeriodicalIF":11.4000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotics and Computer-integrated Manufacturing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0736584525001644","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Hand-based analysis, including hand detection, segmentation, and gesture recognition, plays a pivotal role in enabling natural and intuitive human–robot interaction (HRI). Recent advances in vision-based deep learning (DL) have significantly improved robots’ ability to interpret hand cues across diverse settings. However, previous reviews have not addressed all three tasks collectively or focused on recent DL architectures. Filling this gap, we review recent studies at the intersection of DL and hand-based interaction in HRI. We structure the literature around three core tasks, i.e. hand detection, segmentation, and gesture recognition, highlighting DL models, dataset characteristics, evaluation metrics, and key challenges for each. We further examine the application of these models across industrial, assistive, social, aerial, and space robotics domains. We identify the dominant role of Convolutional and Recurrent Neural Networks (CNNs and RNNs), as well as emerging approaches such as attention-based models (Transformers), uncertainty-aware models, Graph Neural Networks (GNNs), and foundation models, i.e. Vision-Language Models (VLMs) and Large Language Models (LLMs). Our analysis reveals gaps, including the scarcity of HRI-specific datasets, underrepresentation of multi-hand and multi-user scenarios, limited use of RGBD and multi-modal inputs, weak cross-dataset generalization, and inconsistent real-time benchmarking. Dynamic and long-range gestures, multi-view setups, and context-aware understanding also remain relatively underexplored. Despite these limitations, promising directions have emerged, such as multi-modal fusion, use of foundation models for intent reasoning, and the development of lightweight architectures for deployment. This review offers a consolidated foundation to support future research on robust and context-aware DL systems for hand-centric HRI.

查看原文本刊更多论文

人机交互中基于视觉的手部检测、手部分割和手势识别的深度学习研究综述

基于手的分析，包括手检测，分割和手势识别，在实现自然和直观的人机交互（HRI）中起着关键作用。基于视觉的深度学习（DL）的最新进展显著提高了机器人在不同环境下解读手势提示的能力。然而，以前的评论并没有共同解决这三个任务，也没有关注最近的深度学习架构。为了填补这一空白，我们回顾了最近在HRI中DL和基于手的交互交叉方面的研究。我们围绕三个核心任务构建了文献，即手部检测，分割和手势识别，突出了DL模型，数据集特征，评估指标以及每个核心任务的关键挑战。我们进一步研究了这些模型在工业、辅助、社会、航空和空间机器人领域的应用。我们确定了卷积和循环神经网络（cnn和rnn）的主导作用，以及新兴的方法，如基于注意力的模型（Transformers）、不确定性感知模型、图神经网络（gnn）和基础模型，即视觉语言模型（vlm）和大型语言模型（llm）。我们的分析揭示了差距，包括人力资源相关数据集的稀缺、多手和多用户场景的代表性不足、RGBD和多模态输入的使用有限、跨数据集泛化弱以及实时基准测试不一致。动态和远程手势、多视图设置和上下文感知理解也相对未得到充分探索。尽管存在这些限制，但有希望的方向已经出现，例如多模态融合，使用基础模型进行意图推理，以及开发用于部署的轻量级架构。本综述为支持未来针对以手为中心的HRI的鲁棒性和上下文感知深度学习系统的研究提供了坚实的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Robotics and Computer-integrated Manufacturing 工程技术-工程：制造

CiteScore

24.10

自引率

13.50%

发文量

160

审稿时长

50 days

期刊介绍： The journal, Robotics and Computer-Integrated Manufacturing, focuses on sharing research applications that contribute to the development of new or enhanced robotics, manufacturing technologies, and innovative manufacturing strategies that are relevant to industry. Papers that combine theory and experimental validation are preferred, while review papers on current robotics and manufacturing issues are also considered. However, papers on traditional machining processes, modeling and simulation, supply chain management, and resource optimization are generally not within the scope of the journal, as there are more appropriate journals for these topics. Similarly, papers that are overly theoretical or mathematical will be directed to other suitable journals. The journal welcomes original papers in areas such as industrial robotics, human-robot collaboration in manufacturing, cloud-based manufacturing, cyber-physical production systems, big data analytics in manufacturing, smart mechatronics, machine learning, adaptive and sustainable manufacturing, and other fields involving unique manufacturing technologies.