Towards zero-shot robot tool manipulation in industrial context: A modular VLM framework enhanced by multimodal affordance representation

IF 11.4 1区计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Robotics and Computer-integrated Manufacturing Pub Date : 2025-10-07 DOI:10.1016/j.rcim.2025.103161

Qi Zhou, Yuwei Gu, Jiawen Li, Bohan Feng, Boyan Li, Youyi Bi

{"title":"Towards zero-shot robot tool manipulation in industrial context: A modular VLM framework enhanced by multimodal affordance representation","authors":"Qi Zhou, Yuwei Gu, Jiawen Li, Bohan Feng, Boyan Li, Youyi Bi","doi":"10.1016/j.rcim.2025.103161","DOIUrl":null,"url":null,"abstract":"Robot tool manipulation in industrial context requires precise spatial localization, stable force control, and versatile adaptability across diverse tools and tasks. Traditional robot manipulation methods usually struggle to generalize to unseen scenarios and maintain reliable, precise interactions under complex physical constraints. Recent Vision Language Model (VLM)-based approaches demonstrate better generalization ability, but they often lack fine-grained modeling of spatial and force constraints that are critical for real-world industrial applications. To address these challenges, we propose a novel framework for zero-shot robot tool manipulation in industrial environments, named as <ce:italic>ToolManip</ce:italic>. This framework adopts a modular and multi-agent VLM architecture. It decomposes the manipulation process into four specialized modules—task understanding and planning, affordance reasoning, primitive reasoning, and execution monitoring—each handled by a dedicated VLM agent. To enhance the manipulation accuracy, we develop a multimodal affordance representation method that models spatial and force constraints separately. Spatial constraints are encoded via hierarchical region extraction and structured interaction fields to define keypoints and interaction directions, while force constraints are represented through force control primitive reasoning to enable precise and compliant motion and force planning. Additionally, an integrated execution-monitoring pipeline improves the system reliability by tracking the status of each task step and performing stepwise corrections. Experimental results demonstrate that ToolManip achieves robust, generalizable, and high-accuracy performance in various constraint-rich industrial tool manipulation tasks. Our work contributes to the development of advanced robotic manipulation methods for industry and smart manufacturing environments empowered by generative artificial intelligence.","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"115 1","pages":""},"PeriodicalIF":11.4000,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotics and Computer-integrated Manufacturing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.rcim.2025.103161","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Robot tool manipulation in industrial context requires precise spatial localization, stable force control, and versatile adaptability across diverse tools and tasks. Traditional robot manipulation methods usually struggle to generalize to unseen scenarios and maintain reliable, precise interactions under complex physical constraints. Recent Vision Language Model (VLM)-based approaches demonstrate better generalization ability, but they often lack fine-grained modeling of spatial and force constraints that are critical for real-world industrial applications. To address these challenges, we propose a novel framework for zero-shot robot tool manipulation in industrial environments, named as ToolManip. This framework adopts a modular and multi-agent VLM architecture. It decomposes the manipulation process into four specialized modules—task understanding and planning, affordance reasoning, primitive reasoning, and execution monitoring—each handled by a dedicated VLM agent. To enhance the manipulation accuracy, we develop a multimodal affordance representation method that models spatial and force constraints separately. Spatial constraints are encoded via hierarchical region extraction and structured interaction fields to define keypoints and interaction directions, while force constraints are represented through force control primitive reasoning to enable precise and compliant motion and force planning. Additionally, an integrated execution-monitoring pipeline improves the system reliability by tracking the status of each task step and performing stepwise corrections. Experimental results demonstrate that ToolManip achieves robust, generalizable, and high-accuracy performance in various constraint-rich industrial tool manipulation tasks. Our work contributes to the development of advanced robotic manipulation methods for industry and smart manufacturing environments empowered by generative artificial intelligence.

查看原文本刊更多论文

工业环境中的零射击机器人工具操作：一个由多模态功能表示增强的模块化VLM框架

工业环境下的机器人刀具操作需要精确的空间定位，稳定的力控制以及对各种工具和任务的通用适应性。传统的机器人操作方法通常难以推广到看不见的场景，并在复杂的物理约束下保持可靠、精确的交互。最近基于视觉语言模型（VLM）的方法显示出更好的泛化能力，但它们往往缺乏对现实世界工业应用中至关重要的空间和力约束的细粒度建模。为了解决这些挑战，我们提出了一种新的框架，用于工业环境中的零射击机器人工具操作，称为ToolManip。该框架采用模块化多智能体VLM体系结构。它将操作过程分解为四个专门的模块—任务理解和计划、功能推理、基本推理和执行监视—每个模块都由专用的VLM代理处理。为了提高操作精度，我们开发了一种空间约束和力约束分别建模的多模态能力表示方法。空间约束通过分层区域提取和结构化交互域进行编码，定义关键点和交互方向；力约束通过力控制原语推理表示，实现精确、柔性的运动和力规划。此外，集成的执行监视管道通过跟踪每个任务步骤的状态并执行逐步纠正来提高系统可靠性。实验结果表明，在各种约束条件丰富的工业刀具操作任务中，ToolManip实现了鲁棒性、通用性和高精度。我们的工作有助于为工业和智能制造环境开发先进的机器人操作方法，这些方法由生成式人工智能赋予。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Robotics and Computer-integrated Manufacturing 工程技术-工程：制造

CiteScore

24.10

自引率

13.50%

发文量

160

审稿时长

50 days

期刊介绍： The journal, Robotics and Computer-Integrated Manufacturing, focuses on sharing research applications that contribute to the development of new or enhanced robotics, manufacturing technologies, and innovative manufacturing strategies that are relevant to industry. Papers that combine theory and experimental validation are preferred, while review papers on current robotics and manufacturing issues are also considered. However, papers on traditional machining processes, modeling and simulation, supply chain management, and resource optimization are generally not within the scope of the journal, as there are more appropriate journals for these topics. Similarly, papers that are overly theoretical or mathematical will be directed to other suitable journals. The journal welcomes original papers in areas such as industrial robotics, human-robot collaboration in manufacturing, cloud-based manufacturing, cyber-physical production systems, big data analytics in manufacturing, smart mechatronics, machine learning, adaptive and sustainable manufacturing, and other fields involving unique manufacturing technologies.