Changchun Liu , Dunbing Tang , Haihua Zhu , Zequn Zhang , Liping Wang , Yi Zhang
{"title":"Vision language model-enhanced embodied intelligence for digital twin-assisted human-robot collaborative assembly","authors":"Changchun Liu , Dunbing Tang , Haihua Zhu , Zequn Zhang , Liping Wang , Yi Zhang","doi":"10.1016/j.jii.2025.100943","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, embodied intelligence has emerged as a viable approach to achieving human-like perception, reasoning, decision-making, and execution capacities within human-robot collaborative (HRC) assembly contexts. Due to the lack of generalized enabling technologies and disconnections from physical control systems, embodied intelligence requires repetitive training of various functional models to operate in dynamic HRC scenarios, thereby struggling to adapt effectively to complex and evolving HRC environments. Hence, this study proposes a vision-language model (VLM)-enhanced embodied intelligence framework for digital twin (DT)-assisted human-robot collaborative assembly. Initially, the mapping between embodied agents and physical robots is established to achieve the encapsulation of embodied agents. Building upon the agent-based architecture, a VLM driven by both domain knowledge and real-time scenario data is constructed with sensory capabilities. Based on this, rapid recognition and response to dynamic HRC environments can be realized. Leveraging the strong generalization of VLMs, repetitive training of multiple perception models is circumvented. Furthermore, by utilizing the cognitive learning and intelligent reasoning capabilities of VLMs, an expert knowledge system for assembly processes is developed to provide task-oriented assistance and solution generation. To enhance the adaptability and generalization of complex HRC decision-making, VLMs support reinforcement learning through flexible configuration of HRC assembly state information processing, decision-action generation and guidance, and reward function design. In addition, a DT model of the HRC scenario is constructed to provide a simulation and deduction engine (i.e., embodied brain) for mitigating collision accidents. The decision results are then fed into the VLM as invocation parameters for corresponding sub-function code modules, generating complete collaborative robot action code to form the embodied neuron. Finally, compared with traditional decision methods (e.g., MA-A2C, DQN and GA) and VLM-enhanced MA-A2C, a series of comparative experiments conducted in a real-world HRC assembly scenario demonstrate that the proposed framework exhibits competitive advantages.</div></div>","PeriodicalId":55975,"journal":{"name":"Journal of Industrial Information Integration","volume":"48 ","pages":"Article 100943"},"PeriodicalIF":10.4000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Industrial Information Integration","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2452414X25001669","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, embodied intelligence has emerged as a viable approach to achieving human-like perception, reasoning, decision-making, and execution capacities within human-robot collaborative (HRC) assembly contexts. Due to the lack of generalized enabling technologies and disconnections from physical control systems, embodied intelligence requires repetitive training of various functional models to operate in dynamic HRC scenarios, thereby struggling to adapt effectively to complex and evolving HRC environments. Hence, this study proposes a vision-language model (VLM)-enhanced embodied intelligence framework for digital twin (DT)-assisted human-robot collaborative assembly. Initially, the mapping between embodied agents and physical robots is established to achieve the encapsulation of embodied agents. Building upon the agent-based architecture, a VLM driven by both domain knowledge and real-time scenario data is constructed with sensory capabilities. Based on this, rapid recognition and response to dynamic HRC environments can be realized. Leveraging the strong generalization of VLMs, repetitive training of multiple perception models is circumvented. Furthermore, by utilizing the cognitive learning and intelligent reasoning capabilities of VLMs, an expert knowledge system for assembly processes is developed to provide task-oriented assistance and solution generation. To enhance the adaptability and generalization of complex HRC decision-making, VLMs support reinforcement learning through flexible configuration of HRC assembly state information processing, decision-action generation and guidance, and reward function design. In addition, a DT model of the HRC scenario is constructed to provide a simulation and deduction engine (i.e., embodied brain) for mitigating collision accidents. The decision results are then fed into the VLM as invocation parameters for corresponding sub-function code modules, generating complete collaborative robot action code to form the embodied neuron. Finally, compared with traditional decision methods (e.g., MA-A2C, DQN and GA) and VLM-enhanced MA-A2C, a series of comparative experiments conducted in a real-world HRC assembly scenario demonstrate that the proposed framework exhibits competitive advantages.
期刊介绍:
The Journal of Industrial Information Integration focuses on the industry's transition towards industrial integration and informatization, covering not only hardware and software but also information integration. It serves as a platform for promoting advances in industrial information integration, addressing challenges, issues, and solutions in an interdisciplinary forum for researchers, practitioners, and policy makers.
The Journal of Industrial Information Integration welcomes papers on foundational, technical, and practical aspects of industrial information integration, emphasizing the complex and cross-disciplinary topics that arise in industrial integration. Techniques from mathematical science, computer science, computer engineering, electrical and electronic engineering, manufacturing engineering, and engineering management are crucial in this context.