面向任务的视觉-触觉-语言掌握生成方法

IF 11.4 1区计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Robotics and Computer-integrated Manufacturing Pub Date : 2025-10-01 DOI:10.1016/j.rcim.2025.103152

Tong Li, Chengshun Yu, Yuhang Yan, Di Song, Yuxin Shuai, Yifan Wang, Gang Chen

{"title":"面向任务的视觉-触觉-语言掌握生成方法","authors":"Tong Li, Chengshun Yu, Yuhang Yan, Di Song, Yuxin Shuai, Yifan Wang, Gang Chen","doi":"10.1016/j.rcim.2025.103152","DOIUrl":null,"url":null,"abstract":"<div><div>Preceded task-oriented grasp is indispensable for achieving reliable robotic manipulation. Existing task-oriented grasp generation methods typically rely on visual information of the target, deployed on simple parallel-jaw grippers, which often struggle with visual degradation and inadequate grasp reliability and flexibility. In this paper, we propose a task-oriented grasp generation system that integrates external visual and tactile perception on the guidance of textual description. Multimodal encoding consists of two stages: the visual-tactile feature fusion encoding for robot grasp and object spatial perception, and the textual normalization and encoding, followed by spatial perception-semantic feature fusion. We conceive to introduce tactile perception in the pre-contact phase. A visual-tactile fusion method is proposed to combine single-view point cloud with tactile array of pre-contact, partitioning the object surface into multiple contact patches. The vision transformer architecture is employed to gain a spatial representation of the object surface, encoding globally spatial features and implicitly assisting in evaluating the feasibility of grasps across regions by shape learning. To improve the inference effectiveness of the large language model under varying task context, we introduce specifications for textual standardization and migratory comprehension for unknown concepts. The language encoder and principal component analysis are used to encode the given standardized text that follows the above textual generation paradigm. A spatial-semantic feature fusion method is then proposed based on window shift and cross-attention to realize the alignment between task context and object spatial features, ulteriorly preference-based attention on graspable regions. Finally, we present a grasp parameter prediction module based on diffusion model specialized for high-dimensional conditions and generalized hand proprioceptive space, which generates grasp by predicting noise. Experimental results demonstrate that the proposed method outperforms baseline methods requiring the complete object shape, since the mean average precision metrics reaches 72.13%, with an improvement of 1.65%. Each module exhibits performance improvement over conventional methods. Ablation study indicates that the introduction of tactile and text modality improves the metrics by over 3% and 14%. The single-shot success rate for the predicted grasps exceeds 65% on real-world experiments, underscoring the reliability of the proposed system.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"98 ","pages":"Article 103152"},"PeriodicalIF":11.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VTLG: A vision-tactile-language grasp generation method oriented towards task\",\"authors\":\"Tong Li, Chengshun Yu, Yuhang Yan, Di Song, Yuxin Shuai, Yifan Wang, Gang Chen\",\"doi\":\"10.1016/j.rcim.2025.103152\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Preceded task-oriented grasp is indispensable for achieving reliable robotic manipulation. Existing task-oriented grasp generation methods typically rely on visual information of the target, deployed on simple parallel-jaw grippers, which often struggle with visual degradation and inadequate grasp reliability and flexibility. In this paper, we propose a task-oriented grasp generation system that integrates external visual and tactile perception on the guidance of textual description. Multimodal encoding consists of two stages: the visual-tactile feature fusion encoding for robot grasp and object spatial perception, and the textual normalization and encoding, followed by spatial perception-semantic feature fusion. We conceive to introduce tactile perception in the pre-contact phase. A visual-tactile fusion method is proposed to combine single-view point cloud with tactile array of pre-contact, partitioning the object surface into multiple contact patches. The vision transformer architecture is employed to gain a spatial representation of the object surface, encoding globally spatial features and implicitly assisting in evaluating the feasibility of grasps across regions by shape learning. To improve the inference effectiveness of the large language model under varying task context, we introduce specifications for textual standardization and migratory comprehension for unknown concepts. The language encoder and principal component analysis are used to encode the given standardized text that follows the above textual generation paradigm. A spatial-semantic feature fusion method is then proposed based on window shift and cross-attention to realize the alignment between task context and object spatial features, ulteriorly preference-based attention on graspable regions. Finally, we present a grasp parameter prediction module based on diffusion model specialized for high-dimensional conditions and generalized hand proprioceptive space, which generates grasp by predicting noise. Experimental results demonstrate that the proposed method outperforms baseline methods requiring the complete object shape, since the mean average precision metrics reaches 72.13%, with an improvement of 1.65%. Each module exhibits performance improvement over conventional methods. Ablation study indicates that the introduction of tactile and text modality improves the metrics by over 3% and 14%. The single-shot success rate for the predicted grasps exceeds 65% on real-world experiments, underscoring the reliability of the proposed system.</div></div>\",\"PeriodicalId\":21452,\"journal\":{\"name\":\"Robotics and Computer-integrated Manufacturing\",\"volume\":\"98 \",\"pages\":\"Article 103152\"},\"PeriodicalIF\":11.4000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Robotics and Computer-integrated Manufacturing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0736584525002066\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotics and Computer-integrated Manufacturing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0736584525002066","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

先行任务导向的抓取是实现可靠机器人操作的必要条件。现有的面向任务的抓取生成方法通常依赖于目标的视觉信息，部署在简单的平行爪抓取器上，往往存在视觉退化和抓取可靠性和灵活性不足的问题。在本文中，我们提出了一个以文本描述为指导，结合外部视觉和触觉感知的任务导向抓取生成系统。多模态编码包括两个阶段：机器人抓取和物体空间感知的视觉-触觉特征融合编码和文本归一化编码，然后是空间感知-语义特征融合。我们设想在接触前阶段引入触觉感知。提出了一种视触觉融合方法，将单视点云和预接触触觉阵列相结合，将物体表面划分为多个接触块。视觉转换架构用于获得物体表面的空间表示，对全局空间特征进行编码，并通过形状学习隐式地帮助评估跨区域抓取的可行性。为了提高大语言模型在不同任务环境下的推理效率，我们引入了文本标准化和未知概念迁移理解规范。使用语言编码器和主成分分析对遵循上述文本生成范式的给定标准化文本进行编码。在此基础上，提出了一种基于窗口移动和交叉注意的空间语义特征融合方法，实现了任务上下文与目标空间特征的对齐，进而实现了对可抓区域的基于偏好的注意。最后，我们提出了一个基于高维条件下的扩散模型和广义手本体感觉空间的抓握参数预测模块，该模块通过预测噪声产生抓握。实验结果表明，该方法优于要求完整目标形状的基线方法，平均精度指标达到72.13%，提高了1.65%。与传统方法相比，每个模块的性能都有所提高。消融研究表明，触觉和文本模式的引入提高了超过3%和14%的指标。在现实世界的实验中，预测抓取的单次成功率超过65%，强调了所提出系统的可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VTLG: A vision-tactile-language grasp generation method oriented towards task

Preceded task-oriented grasp is indispensable for achieving reliable robotic manipulation. Existing task-oriented grasp generation methods typically rely on visual information of the target, deployed on simple parallel-jaw grippers, which often struggle with visual degradation and inadequate grasp reliability and flexibility. In this paper, we propose a task-oriented grasp generation system that integrates external visual and tactile perception on the guidance of textual description. Multimodal encoding consists of two stages: the visual-tactile feature fusion encoding for robot grasp and object spatial perception, and the textual normalization and encoding, followed by spatial perception-semantic feature fusion. We conceive to introduce tactile perception in the pre-contact phase. A visual-tactile fusion method is proposed to combine single-view point cloud with tactile array of pre-contact, partitioning the object surface into multiple contact patches. The vision transformer architecture is employed to gain a spatial representation of the object surface, encoding globally spatial features and implicitly assisting in evaluating the feasibility of grasps across regions by shape learning. To improve the inference effectiveness of the large language model under varying task context, we introduce specifications for textual standardization and migratory comprehension for unknown concepts. The language encoder and principal component analysis are used to encode the given standardized text that follows the above textual generation paradigm. A spatial-semantic feature fusion method is then proposed based on window shift and cross-attention to realize the alignment between task context and object spatial features, ulteriorly preference-based attention on graspable regions. Finally, we present a grasp parameter prediction module based on diffusion model specialized for high-dimensional conditions and generalized hand proprioceptive space, which generates grasp by predicting noise. Experimental results demonstrate that the proposed method outperforms baseline methods requiring the complete object shape, since the mean average precision metrics reaches 72.13%, with an improvement of 1.65%. Each module exhibits performance improvement over conventional methods. Ablation study indicates that the introduction of tactile and text modality improves the metrics by over 3% and 14%. The single-shot success rate for the predicted grasps exceeds 65% on real-world experiments, underscoring the reliability of the proposed system.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Robotics and Computer-integrated Manufacturing 工程技术-工程：制造

CiteScore

24.10

自引率

13.50%

发文量

160

审稿时长

50 days

期刊介绍： The journal, Robotics and Computer-Integrated Manufacturing, focuses on sharing research applications that contribute to the development of new or enhanced robotics, manufacturing technologies, and innovative manufacturing strategies that are relevant to industry. Papers that combine theory and experimental validation are preferred, while review papers on current robotics and manufacturing issues are also considered. However, papers on traditional machining processes, modeling and simulation, supply chain management, and resource optimization are generally not within the scope of the journal, as there are more appropriate journals for these topics. Similarly, papers that are overly theoretical or mathematical will be directed to other suitable journals. The journal welcomes original papers in areas such as industrial robotics, human-robot collaboration in manufacturing, cloud-based manufacturing, cyber-physical production systems, big data analytics in manufacturing, smart mechatronics, machine learning, adaptive and sustainable manufacturing, and other fields involving unique manufacturing technologies.