连接视觉与触觉：利用自监督多模态学习推进机器人交互预测。

IF 2.9 Q2 ROBOTICS

Frontiers in Robotics and AI Pub Date : 2024-09-30 eCollection Date: 2024-01-01 DOI:10.3389/frobt.2024.1407519

Luchen Li, Thomas George Thuruthel

{"title":"连接视觉与触觉：利用自监督多模态学习推进机器人交互预测。","authors":"Luchen Li, Thomas George Thuruthel","doi":"10.3389/frobt.2024.1407519","DOIUrl":null,"url":null,"abstract":"Predicting the consequences of the agent's actions on its environment is a pivotal challenge in robotic learning, which plays a key role in developing higher cognitive skills for intelligent robots. While current methods have predominantly relied on vision and motion data to generate the predicted videos, more comprehensive sensory perception is required for complex physical interactions such as contact-rich manipulation or highly dynamic tasks. In this work, we investigate the interdependence between vision and tactile sensation in the scenario of dynamic robotic interaction. A multi-modal fusion mechanism is introduced to the action-conditioned video prediction model to forecast future scenes, which enriches the single-modality prototype with a compressed latent representation of multiple sensory inputs. Additionally, to accomplish the interactive setting, we built a robotic interaction system that is equipped with both web cameras and vision-based tactile sensors to collect the dataset of vision-tactile sequences and the corresponding robot action data. Finally, through a series of qualitative and quantitative comparative study of different prediction architecture and tasks, we present insightful analysis of the cross-modality influence between vision, tactile and action, revealing the asymmetrical impact that exists between the sensations when contributing to interpreting the environment information. This opens possibilities for more adaptive and efficient robotic control in complex environments, with implications for dexterous manipulation and human-robot interaction.","PeriodicalId":47597,"journal":{"name":"Frontiers in Robotics and AI","volume":"11 ","pages":"1407519"},"PeriodicalIF":2.9000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11472251/pdf/","citationCount":"0","resultStr":"{\"title\":\"Bridging vision and touch: advancing robotic interaction prediction with self-supervised multimodal learning.\",\"authors\":\"Luchen Li, Thomas George Thuruthel\",\"doi\":\"10.3389/frobt.2024.1407519\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Predicting the consequences of the agent's actions on its environment is a pivotal challenge in robotic learning, which plays a key role in developing higher cognitive skills for intelligent robots. While current methods have predominantly relied on vision and motion data to generate the predicted videos, more comprehensive sensory perception is required for complex physical interactions such as contact-rich manipulation or highly dynamic tasks. In this work, we investigate the interdependence between vision and tactile sensation in the scenario of dynamic robotic interaction. A multi-modal fusion mechanism is introduced to the action-conditioned video prediction model to forecast future scenes, which enriches the single-modality prototype with a compressed latent representation of multiple sensory inputs. Additionally, to accomplish the interactive setting, we built a robotic interaction system that is equipped with both web cameras and vision-based tactile sensors to collect the dataset of vision-tactile sequences and the corresponding robot action data. Finally, through a series of qualitative and quantitative comparative study of different prediction architecture and tasks, we present insightful analysis of the cross-modality influence between vision, tactile and action, revealing the asymmetrical impact that exists between the sensations when contributing to interpreting the environment information. This opens possibilities for more adaptive and efficient robotic control in complex environments, with implications for dexterous manipulation and human-robot interaction.\",\"PeriodicalId\":47597,\"journal\":{\"name\":\"Frontiers in Robotics and AI\",\"volume\":\"11 \",\"pages\":\"1407519\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11472251/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Robotics and AI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/frobt.2024.1407519\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Robotics and AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frobt.2024.1407519","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

摘要

预测机器人行动对环境造成的后果是机器人学习中的一个关键挑战，这对开发智能机器人的高级认知技能起着关键作用。虽然目前的方法主要依赖视觉和运动数据来生成预测视频，但对于复杂的物理交互，如接触丰富的操作或高度动态的任务，需要更全面的感官感知。在这项工作中，我们研究了动态机器人交互场景中视觉和触觉之间的相互依存关系。为了预测未来场景，我们在动作条件视频预测模型中引入了多模态融合机制，利用多种感官输入的压缩潜表征来丰富单模态原型。此外，为了完成交互设置，我们建立了一个机器人交互系统，该系统配备了网络摄像头和基于视觉的触觉传感器，用于收集视觉-触觉序列数据集和相应的机器人动作数据。最后，通过对不同的预测架构和任务进行一系列定性和定量比较研究，我们对视觉、触觉和动作之间的跨模态影响进行了深入分析，揭示了各种感觉在解释环境信息时存在的不对称影响。这为在复杂环境中实现更具适应性和更高效的机器人控制提供了可能性，并对灵巧操纵和人机交互产生了影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bridging vision and touch: advancing robotic interaction prediction with self-supervised multimodal learning.

Predicting the consequences of the agent's actions on its environment is a pivotal challenge in robotic learning, which plays a key role in developing higher cognitive skills for intelligent robots. While current methods have predominantly relied on vision and motion data to generate the predicted videos, more comprehensive sensory perception is required for complex physical interactions such as contact-rich manipulation or highly dynamic tasks. In this work, we investigate the interdependence between vision and tactile sensation in the scenario of dynamic robotic interaction. A multi-modal fusion mechanism is introduced to the action-conditioned video prediction model to forecast future scenes, which enriches the single-modality prototype with a compressed latent representation of multiple sensory inputs. Additionally, to accomplish the interactive setting, we built a robotic interaction system that is equipped with both web cameras and vision-based tactile sensors to collect the dataset of vision-tactile sequences and the corresponding robot action data. Finally, through a series of qualitative and quantitative comparative study of different prediction architecture and tasks, we present insightful analysis of the cross-modality influence between vision, tactile and action, revealing the asymmetrical impact that exists between the sensations when contributing to interpreting the environment information. This opens possibilities for more adaptive and efficient robotic control in complex environments, with implications for dexterous manipulation and human-robot interaction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in Robotics and AI ROBOTICS-

CiteScore

6.50

自引率

5.90%

发文量

355

审稿时长

14 weeks

期刊介绍： Frontiers in Robotics and AI publishes rigorously peer-reviewed research covering all theory and applications of robotics, technology, and artificial intelligence, from biomedical to space robotics.