Lili Dong , Tianliang Hu , Tianyi Sun , Junrui Li , Songhua Ma
{"title":"人机协同制造中人机动作识别的RGB视频与惯性传感融合方法","authors":"Lili Dong , Tianliang Hu , Tianyi Sun , Junrui Li , Songhua Ma","doi":"10.1016/j.jmsy.2025.09.007","DOIUrl":null,"url":null,"abstract":"<div><div>Human action recognition (HAR), as a prerequisite for robotic dynamic decision-making, is crucial for achieving efficient human-robot collaborative manufacturing (HRCM). Compared with single modality, multi-modality provides a more comprehensive understanding of human actions. However, it is a challenge to effectively integrate this information to fully leverage the advantages of multi-modality for HAR in HRCM. Therefore, in this paper, the RGB video and inertial sensing fusion method for HAR in HRCM is proposed, presenting the systematic exploration of this multi-modality in industrial contexts. Two fusion strategies of two modalities are studied: decision-level fusion and feature-level fusion. Secondly, taking the rotary vector (RV) reducer assembly as an example, a multi-modal human assembly action dataset for HAR (HAAD-SDU) is designed, filling the gap in the HRCM field where publicly representative datasets are scarce. This dataset synchronously introduces RGB video and inertial sensing data containing human assembly information. Finally, the feasibility and effectiveness of the proposed approach are verified by the designed dataset and public dataset, demonstrating superior performance over baseline methods. The experimental results demonstrate that the proposed fusion approach integrating RGB video and inertial sensing modalities not only overcomes the limitations of the single modality but also exhibits strong cross-domain generalizability, proving effective for both industrial tasks and daily activity recognition. In the HRCM scenario specifically, both decision-level and feature-level fusion strategies demonstrate superior recognition capabilities. The decision-level fusion provides a higher recognition accuracy of 95.71 %, while the feature-level fusion achieves competitive accuracy at 94.42 % with low recognition latency of 1.67 s. Notably, the proposed fusion model can accurately recognize human behaviors at least 2 s before they are completed, providing sufficient leftover time for the robotic system to complete collaborative tasks.</div></div>","PeriodicalId":16227,"journal":{"name":"Journal of Manufacturing Systems","volume":"83 ","pages":"Pages 216-234"},"PeriodicalIF":14.2000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RGB video and inertial sensing fusion method for human action recognition in human-robot collaborative manufacturing\",\"authors\":\"Lili Dong , Tianliang Hu , Tianyi Sun , Junrui Li , Songhua Ma\",\"doi\":\"10.1016/j.jmsy.2025.09.007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Human action recognition (HAR), as a prerequisite for robotic dynamic decision-making, is crucial for achieving efficient human-robot collaborative manufacturing (HRCM). Compared with single modality, multi-modality provides a more comprehensive understanding of human actions. However, it is a challenge to effectively integrate this information to fully leverage the advantages of multi-modality for HAR in HRCM. Therefore, in this paper, the RGB video and inertial sensing fusion method for HAR in HRCM is proposed, presenting the systematic exploration of this multi-modality in industrial contexts. Two fusion strategies of two modalities are studied: decision-level fusion and feature-level fusion. Secondly, taking the rotary vector (RV) reducer assembly as an example, a multi-modal human assembly action dataset for HAR (HAAD-SDU) is designed, filling the gap in the HRCM field where publicly representative datasets are scarce. This dataset synchronously introduces RGB video and inertial sensing data containing human assembly information. Finally, the feasibility and effectiveness of the proposed approach are verified by the designed dataset and public dataset, demonstrating superior performance over baseline methods. The experimental results demonstrate that the proposed fusion approach integrating RGB video and inertial sensing modalities not only overcomes the limitations of the single modality but also exhibits strong cross-domain generalizability, proving effective for both industrial tasks and daily activity recognition. In the HRCM scenario specifically, both decision-level and feature-level fusion strategies demonstrate superior recognition capabilities. The decision-level fusion provides a higher recognition accuracy of 95.71 %, while the feature-level fusion achieves competitive accuracy at 94.42 % with low recognition latency of 1.67 s. Notably, the proposed fusion model can accurately recognize human behaviors at least 2 s before they are completed, providing sufficient leftover time for the robotic system to complete collaborative tasks.</div></div>\",\"PeriodicalId\":16227,\"journal\":{\"name\":\"Journal of Manufacturing Systems\",\"volume\":\"83 \",\"pages\":\"Pages 216-234\"},\"PeriodicalIF\":14.2000,\"publicationDate\":\"2025-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Manufacturing Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0278612525002341\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, INDUSTRIAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Manufacturing Systems","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0278612525002341","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, INDUSTRIAL","Score":null,"Total":0}
RGB video and inertial sensing fusion method for human action recognition in human-robot collaborative manufacturing
Human action recognition (HAR), as a prerequisite for robotic dynamic decision-making, is crucial for achieving efficient human-robot collaborative manufacturing (HRCM). Compared with single modality, multi-modality provides a more comprehensive understanding of human actions. However, it is a challenge to effectively integrate this information to fully leverage the advantages of multi-modality for HAR in HRCM. Therefore, in this paper, the RGB video and inertial sensing fusion method for HAR in HRCM is proposed, presenting the systematic exploration of this multi-modality in industrial contexts. Two fusion strategies of two modalities are studied: decision-level fusion and feature-level fusion. Secondly, taking the rotary vector (RV) reducer assembly as an example, a multi-modal human assembly action dataset for HAR (HAAD-SDU) is designed, filling the gap in the HRCM field where publicly representative datasets are scarce. This dataset synchronously introduces RGB video and inertial sensing data containing human assembly information. Finally, the feasibility and effectiveness of the proposed approach are verified by the designed dataset and public dataset, demonstrating superior performance over baseline methods. The experimental results demonstrate that the proposed fusion approach integrating RGB video and inertial sensing modalities not only overcomes the limitations of the single modality but also exhibits strong cross-domain generalizability, proving effective for both industrial tasks and daily activity recognition. In the HRCM scenario specifically, both decision-level and feature-level fusion strategies demonstrate superior recognition capabilities. The decision-level fusion provides a higher recognition accuracy of 95.71 %, while the feature-level fusion achieves competitive accuracy at 94.42 % with low recognition latency of 1.67 s. Notably, the proposed fusion model can accurately recognize human behaviors at least 2 s before they are completed, providing sufficient leftover time for the robotic system to complete collaborative tasks.
期刊介绍:
The Journal of Manufacturing Systems is dedicated to showcasing cutting-edge fundamental and applied research in manufacturing at the systems level. Encompassing products, equipment, people, information, control, and support functions, manufacturing systems play a pivotal role in the economical and competitive development, production, delivery, and total lifecycle of products, meeting market and societal needs.
With a commitment to publishing archival scholarly literature, the journal strives to advance the state of the art in manufacturing systems and foster innovation in crafting efficient, robust, and sustainable manufacturing systems. The focus extends from equipment-level considerations to the broader scope of the extended enterprise. The Journal welcomes research addressing challenges across various scales, including nano, micro, and macro-scale manufacturing, and spanning diverse sectors such as aerospace, automotive, energy, and medical device manufacturing.