基于视觉语言模型的数字化双辅助人机协同装配体智能研究

IF 10.4 1区计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Industrial Information Integration Pub Date : 2025-09-02 DOI:10.1016/j.jii.2025.100943

Changchun Liu , Dunbing Tang , Haihua Zhu , Zequn Zhang , Liping Wang , Yi Zhang

{"title":"基于视觉语言模型的数字化双辅助人机协同装配体智能研究","authors":"Changchun Liu , Dunbing Tang , Haihua Zhu , Zequn Zhang , Liping Wang , Yi Zhang","doi":"10.1016/j.jii.2025.100943","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, embodied intelligence has emerged as a viable approach to achieving human-like perception, reasoning, decision-making, and execution capacities within human-robot collaborative (HRC) assembly contexts. Due to the lack of generalized enabling technologies and disconnections from physical control systems, embodied intelligence requires repetitive training of various functional models to operate in dynamic HRC scenarios, thereby struggling to adapt effectively to complex and evolving HRC environments. Hence, this study proposes a vision-language model (VLM)-enhanced embodied intelligence framework for digital twin (DT)-assisted human-robot collaborative assembly. Initially, the mapping between embodied agents and physical robots is established to achieve the encapsulation of embodied agents. Building upon the agent-based architecture, a VLM driven by both domain knowledge and real-time scenario data is constructed with sensory capabilities. Based on this, rapid recognition and response to dynamic HRC environments can be realized. Leveraging the strong generalization of VLMs, repetitive training of multiple perception models is circumvented. Furthermore, by utilizing the cognitive learning and intelligent reasoning capabilities of VLMs, an expert knowledge system for assembly processes is developed to provide task-oriented assistance and solution generation. To enhance the adaptability and generalization of complex HRC decision-making, VLMs support reinforcement learning through flexible configuration of HRC assembly state information processing, decision-action generation and guidance, and reward function design. In addition, a DT model of the HRC scenario is constructed to provide a simulation and deduction engine (i.e., embodied brain) for mitigating collision accidents. The decision results are then fed into the VLM as invocation parameters for corresponding sub-function code modules, generating complete collaborative robot action code to form the embodied neuron. Finally, compared with traditional decision methods (e.g., MA-A2C, DQN and GA) and VLM-enhanced MA-A2C, a series of comparative experiments conducted in a real-world HRC assembly scenario demonstrate that the proposed framework exhibits competitive advantages.</div></div>","PeriodicalId":55975,"journal":{"name":"Journal of Industrial Information Integration","volume":"48 ","pages":"Article 100943"},"PeriodicalIF":10.4000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vision language model-enhanced embodied intelligence for digital twin-assisted human-robot collaborative assembly\",\"authors\":\"Changchun Liu , Dunbing Tang , Haihua Zhu , Zequn Zhang , Liping Wang , Yi Zhang\",\"doi\":\"10.1016/j.jii.2025.100943\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, embodied intelligence has emerged as a viable approach to achieving human-like perception, reasoning, decision-making, and execution capacities within human-robot collaborative (HRC) assembly contexts. Due to the lack of generalized enabling technologies and disconnections from physical control systems, embodied intelligence requires repetitive training of various functional models to operate in dynamic HRC scenarios, thereby struggling to adapt effectively to complex and evolving HRC environments. Hence, this study proposes a vision-language model (VLM)-enhanced embodied intelligence framework for digital twin (DT)-assisted human-robot collaborative assembly. Initially, the mapping between embodied agents and physical robots is established to achieve the encapsulation of embodied agents. Building upon the agent-based architecture, a VLM driven by both domain knowledge and real-time scenario data is constructed with sensory capabilities. Based on this, rapid recognition and response to dynamic HRC environments can be realized. Leveraging the strong generalization of VLMs, repetitive training of multiple perception models is circumvented. Furthermore, by utilizing the cognitive learning and intelligent reasoning capabilities of VLMs, an expert knowledge system for assembly processes is developed to provide task-oriented assistance and solution generation. To enhance the adaptability and generalization of complex HRC decision-making, VLMs support reinforcement learning through flexible configuration of HRC assembly state information processing, decision-action generation and guidance, and reward function design. In addition, a DT model of the HRC scenario is constructed to provide a simulation and deduction engine (i.e., embodied brain) for mitigating collision accidents. The decision results are then fed into the VLM as invocation parameters for corresponding sub-function code modules, generating complete collaborative robot action code to form the embodied neuron. Finally, compared with traditional decision methods (e.g., MA-A2C, DQN and GA) and VLM-enhanced MA-A2C, a series of comparative experiments conducted in a real-world HRC assembly scenario demonstrate that the proposed framework exhibits competitive advantages.</div></div>\",\"PeriodicalId\":55975,\"journal\":{\"name\":\"Journal of Industrial Information Integration\",\"volume\":\"48 \",\"pages\":\"Article 100943\"},\"PeriodicalIF\":10.4000,\"publicationDate\":\"2025-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Industrial Information Integration\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2452414X25001669\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Industrial Information Integration","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2452414X25001669","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

最近，具身智能已经成为在人机协作（HRC）装配环境中实现类人感知、推理、决策和执行能力的可行方法。由于缺乏广泛的使能技术和与物理控制系统的脱节，具身智能需要对各种功能模型进行重复训练，以在动态HRC场景中运行，从而难以有效地适应复杂和不断变化的HRC环境。因此，本研究提出了一个视觉语言模型（VLM）增强的具身智能框架，用于数字孪生（DT）辅助的人机协同装配。首先建立具身智能体与物理机器人之间的映射关系，实现对具身智能体的封装。在基于智能体的体系结构基础上，构建了具有感知能力的由领域知识和实时场景数据驱动的VLM。基于此，可以实现对动态HRC环境的快速识别和响应。利用vlm的强泛化，避免了多个感知模型的重复训练。利用vlm的认知学习和智能推理能力，开发了面向装配过程的专家知识系统，提供面向任务的辅助和解决方案生成。为了提高复杂HRC决策的适应性和泛化能力，VLMs通过HRC装配状态信息处理、决策行为生成与指导、奖励函数设计等柔性配置支持强化学习。此外，构建了HRC场景的DT模型，为减轻碰撞事故提供了仿真和推理引擎（即具身脑）。然后将决策结果作为相应子功能码模块的调用参数馈送到VLM中，生成完整的协作机器人动作码，形成嵌入神经元。最后，与传统的决策方法（如MA-A2C、DQN和GA）和vlm增强的MA-A2C相比，在现实HRC装配场景中进行的一系列对比实验表明，所提出的框架具有竞争优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Vision language model-enhanced embodied intelligence for digital twin-assisted human-robot collaborative assembly

Recently, embodied intelligence has emerged as a viable approach to achieving human-like perception, reasoning, decision-making, and execution capacities within human-robot collaborative (HRC) assembly contexts. Due to the lack of generalized enabling technologies and disconnections from physical control systems, embodied intelligence requires repetitive training of various functional models to operate in dynamic HRC scenarios, thereby struggling to adapt effectively to complex and evolving HRC environments. Hence, this study proposes a vision-language model (VLM)-enhanced embodied intelligence framework for digital twin (DT)-assisted human-robot collaborative assembly. Initially, the mapping between embodied agents and physical robots is established to achieve the encapsulation of embodied agents. Building upon the agent-based architecture, a VLM driven by both domain knowledge and real-time scenario data is constructed with sensory capabilities. Based on this, rapid recognition and response to dynamic HRC environments can be realized. Leveraging the strong generalization of VLMs, repetitive training of multiple perception models is circumvented. Furthermore, by utilizing the cognitive learning and intelligent reasoning capabilities of VLMs, an expert knowledge system for assembly processes is developed to provide task-oriented assistance and solution generation. To enhance the adaptability and generalization of complex HRC decision-making, VLMs support reinforcement learning through flexible configuration of HRC assembly state information processing, decision-action generation and guidance, and reward function design. In addition, a DT model of the HRC scenario is constructed to provide a simulation and deduction engine (i.e., embodied brain) for mitigating collision accidents. The decision results are then fed into the VLM as invocation parameters for corresponding sub-function code modules, generating complete collaborative robot action code to form the embodied neuron. Finally, compared with traditional decision methods (e.g., MA-A2C, DQN and GA) and VLM-enhanced MA-A2C, a series of comparative experiments conducted in a real-world HRC assembly scenario demonstrate that the proposed framework exhibits competitive advantages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Industrial Information Integration Decision Sciences-Information Systems and Management

CiteScore

22.30

自引率

13.40%

发文量

100

期刊介绍： The Journal of Industrial Information Integration focuses on the industry's transition towards industrial integration and informatization, covering not only hardware and software but also information integration. It serves as a platform for promoting advances in industrial information integration, addressing challenges, issues, and solutions in an interdisciplinary forum for researchers, practitioners, and policy makers. The Journal of Industrial Information Integration welcomes papers on foundational, technical, and practical aspects of industrial information integration, emphasizing the complex and cross-disciplinary topics that arise in industrial integration. Techniques from mathematical science, computer science, computer engineering, electrical and electronic engineering, manufacturing engineering, and engineering management are crucial in this context.