{"title":"Towards zero-shot robot tool manipulation in industrial context: A modular VLM framework enhanced by multimodal affordance representation","authors":"Qi Zhou, Yuwei Gu, Jiawen Li, Bohan Feng, Boyan Li, Youyi Bi","doi":"10.1016/j.rcim.2025.103161","DOIUrl":null,"url":null,"abstract":"Robot tool manipulation in industrial context requires precise spatial localization, stable force control, and versatile adaptability across diverse tools and tasks. Traditional robot manipulation methods usually struggle to generalize to unseen scenarios and maintain reliable, precise interactions under complex physical constraints. Recent Vision Language Model (VLM)-based approaches demonstrate better generalization ability, but they often lack fine-grained modeling of spatial and force constraints that are critical for real-world industrial applications. To address these challenges, we propose a novel framework for zero-shot robot tool manipulation in industrial environments, named as <ce:italic>ToolManip</ce:italic>. This framework adopts a modular and multi-agent VLM architecture. It decomposes the manipulation process into four specialized modules—task understanding and planning, affordance reasoning, primitive reasoning, and execution monitoring—each handled by a dedicated VLM agent. To enhance the manipulation accuracy, we develop a multimodal affordance representation method that models spatial and force constraints separately. Spatial constraints are encoded via hierarchical region extraction and structured interaction fields to define keypoints and interaction directions, while force constraints are represented through force control primitive reasoning to enable precise and compliant motion and force planning. Additionally, an integrated execution-monitoring pipeline improves the system reliability by tracking the status of each task step and performing stepwise corrections. Experimental results demonstrate that ToolManip achieves robust, generalizable, and high-accuracy performance in various constraint-rich industrial tool manipulation tasks. Our work contributes to the development of advanced robotic manipulation methods for industry and smart manufacturing environments empowered by generative artificial intelligence.","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"115 1","pages":""},"PeriodicalIF":11.4000,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotics and Computer-integrated Manufacturing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.rcim.2025.103161","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Robot tool manipulation in industrial context requires precise spatial localization, stable force control, and versatile adaptability across diverse tools and tasks. Traditional robot manipulation methods usually struggle to generalize to unseen scenarios and maintain reliable, precise interactions under complex physical constraints. Recent Vision Language Model (VLM)-based approaches demonstrate better generalization ability, but they often lack fine-grained modeling of spatial and force constraints that are critical for real-world industrial applications. To address these challenges, we propose a novel framework for zero-shot robot tool manipulation in industrial environments, named as ToolManip. This framework adopts a modular and multi-agent VLM architecture. It decomposes the manipulation process into four specialized modules—task understanding and planning, affordance reasoning, primitive reasoning, and execution monitoring—each handled by a dedicated VLM agent. To enhance the manipulation accuracy, we develop a multimodal affordance representation method that models spatial and force constraints separately. Spatial constraints are encoded via hierarchical region extraction and structured interaction fields to define keypoints and interaction directions, while force constraints are represented through force control primitive reasoning to enable precise and compliant motion and force planning. Additionally, an integrated execution-monitoring pipeline improves the system reliability by tracking the status of each task step and performing stepwise corrections. Experimental results demonstrate that ToolManip achieves robust, generalizable, and high-accuracy performance in various constraint-rich industrial tool manipulation tasks. Our work contributes to the development of advanced robotic manipulation methods for industry and smart manufacturing environments empowered by generative artificial intelligence.
期刊介绍:
The journal, Robotics and Computer-Integrated Manufacturing, focuses on sharing research applications that contribute to the development of new or enhanced robotics, manufacturing technologies, and innovative manufacturing strategies that are relevant to industry. Papers that combine theory and experimental validation are preferred, while review papers on current robotics and manufacturing issues are also considered. However, papers on traditional machining processes, modeling and simulation, supply chain management, and resource optimization are generally not within the scope of the journal, as there are more appropriate journals for these topics. Similarly, papers that are overly theoretical or mathematical will be directed to other suitable journals. The journal welcomes original papers in areas such as industrial robotics, human-robot collaboration in manufacturing, cloud-based manufacturing, cyber-physical production systems, big data analytics in manufacturing, smart mechatronics, machine learning, adaptive and sustainable manufacturing, and other fields involving unique manufacturing technologies.