Tong Li, Chengshun Yu, Yuhang Yan, Di Song, Yuxin Shuai, Yifan Wang, Gang Chen
{"title":"VTLG: A vision-tactile-language grasp generation method oriented towards task","authors":"Tong Li, Chengshun Yu, Yuhang Yan, Di Song, Yuxin Shuai, Yifan Wang, Gang Chen","doi":"10.1016/j.rcim.2025.103152","DOIUrl":null,"url":null,"abstract":"<div><div>Preceded task-oriented grasp is indispensable for achieving reliable robotic manipulation. Existing task-oriented grasp generation methods typically rely on visual information of the target, deployed on simple parallel-jaw grippers, which often struggle with visual degradation and inadequate grasp reliability and flexibility. In this paper, we propose a task-oriented grasp generation system that integrates external visual and tactile perception on the guidance of textual description. Multimodal encoding consists of two stages: the visual-tactile feature fusion encoding for robot grasp and object spatial perception, and the textual normalization and encoding, followed by spatial perception-semantic feature fusion. We conceive to introduce tactile perception in the pre-contact phase. A visual-tactile fusion method is proposed to combine single-view point cloud with tactile array of pre-contact, partitioning the object surface into multiple contact patches. The vision transformer architecture is employed to gain a spatial representation of the object surface, encoding globally spatial features and implicitly assisting in evaluating the feasibility of grasps across regions by shape learning. To improve the inference effectiveness of the large language model under varying task context, we introduce specifications for textual standardization and migratory comprehension for unknown concepts. The language encoder and principal component analysis are used to encode the given standardized text that follows the above textual generation paradigm. A spatial-semantic feature fusion method is then proposed based on window shift and cross-attention to realize the alignment between task context and object spatial features, ulteriorly preference-based attention on graspable regions. Finally, we present a grasp parameter prediction module based on diffusion model specialized for high-dimensional conditions and generalized hand proprioceptive space, which generates grasp by predicting noise. Experimental results demonstrate that the proposed method outperforms baseline methods requiring the complete object shape, since the mean average precision metrics reaches 72.13%, with an improvement of 1.65%. Each module exhibits performance improvement over conventional methods. Ablation study indicates that the introduction of tactile and text modality improves the metrics by over 3% and 14%. The single-shot success rate for the predicted grasps exceeds 65% on real-world experiments, underscoring the reliability of the proposed system.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"98 ","pages":"Article 103152"},"PeriodicalIF":11.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotics and Computer-integrated Manufacturing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0736584525002066","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Preceded task-oriented grasp is indispensable for achieving reliable robotic manipulation. Existing task-oriented grasp generation methods typically rely on visual information of the target, deployed on simple parallel-jaw grippers, which often struggle with visual degradation and inadequate grasp reliability and flexibility. In this paper, we propose a task-oriented grasp generation system that integrates external visual and tactile perception on the guidance of textual description. Multimodal encoding consists of two stages: the visual-tactile feature fusion encoding for robot grasp and object spatial perception, and the textual normalization and encoding, followed by spatial perception-semantic feature fusion. We conceive to introduce tactile perception in the pre-contact phase. A visual-tactile fusion method is proposed to combine single-view point cloud with tactile array of pre-contact, partitioning the object surface into multiple contact patches. The vision transformer architecture is employed to gain a spatial representation of the object surface, encoding globally spatial features and implicitly assisting in evaluating the feasibility of grasps across regions by shape learning. To improve the inference effectiveness of the large language model under varying task context, we introduce specifications for textual standardization and migratory comprehension for unknown concepts. The language encoder and principal component analysis are used to encode the given standardized text that follows the above textual generation paradigm. A spatial-semantic feature fusion method is then proposed based on window shift and cross-attention to realize the alignment between task context and object spatial features, ulteriorly preference-based attention on graspable regions. Finally, we present a grasp parameter prediction module based on diffusion model specialized for high-dimensional conditions and generalized hand proprioceptive space, which generates grasp by predicting noise. Experimental results demonstrate that the proposed method outperforms baseline methods requiring the complete object shape, since the mean average precision metrics reaches 72.13%, with an improvement of 1.65%. Each module exhibits performance improvement over conventional methods. Ablation study indicates that the introduction of tactile and text modality improves the metrics by over 3% and 14%. The single-shot success rate for the predicted grasps exceeds 65% on real-world experiments, underscoring the reliability of the proposed system.
期刊介绍:
The journal, Robotics and Computer-Integrated Manufacturing, focuses on sharing research applications that contribute to the development of new or enhanced robotics, manufacturing technologies, and innovative manufacturing strategies that are relevant to industry. Papers that combine theory and experimental validation are preferred, while review papers on current robotics and manufacturing issues are also considered. However, papers on traditional machining processes, modeling and simulation, supply chain management, and resource optimization are generally not within the scope of the journal, as there are more appropriate journals for these topics. Similarly, papers that are overly theoretical or mathematical will be directed to other suitable journals. The journal welcomes original papers in areas such as industrial robotics, human-robot collaboration in manufacturing, cloud-based manufacturing, cyber-physical production systems, big data analytics in manufacturing, smart mechatronics, machine learning, adaptive and sustainable manufacturing, and other fields involving unique manufacturing technologies.