Bridging vision and touch: advancing robotic interaction prediction with self-supervised multimodal learning.

IF 2.9 Q2 ROBOTICS
Frontiers in Robotics and AI Pub Date : 2024-09-30 eCollection Date: 2024-01-01 DOI:10.3389/frobt.2024.1407519
Luchen Li, Thomas George Thuruthel
{"title":"Bridging vision and touch: advancing robotic interaction prediction with self-supervised multimodal learning.","authors":"Luchen Li, Thomas George Thuruthel","doi":"10.3389/frobt.2024.1407519","DOIUrl":null,"url":null,"abstract":"<p><p>Predicting the consequences of the agent's actions on its environment is a pivotal challenge in robotic learning, which plays a key role in developing higher cognitive skills for intelligent robots. While current methods have predominantly relied on vision and motion data to generate the predicted videos, more comprehensive sensory perception is required for complex physical interactions such as contact-rich manipulation or highly dynamic tasks. In this work, we investigate the interdependence between vision and tactile sensation in the scenario of dynamic robotic interaction. A multi-modal fusion mechanism is introduced to the action-conditioned video prediction model to forecast future scenes, which enriches the single-modality prototype with a compressed latent representation of multiple sensory inputs. Additionally, to accomplish the interactive setting, we built a robotic interaction system that is equipped with both web cameras and vision-based tactile sensors to collect the dataset of vision-tactile sequences and the corresponding robot action data. Finally, through a series of qualitative and quantitative comparative study of different prediction architecture and tasks, we present insightful analysis of the cross-modality influence between vision, tactile and action, revealing the asymmetrical impact that exists between the sensations when contributing to interpreting the environment information. This opens possibilities for more adaptive and efficient robotic control in complex environments, with implications for dexterous manipulation and human-robot interaction.</p>","PeriodicalId":47597,"journal":{"name":"Frontiers in Robotics and AI","volume":"11 ","pages":"1407519"},"PeriodicalIF":2.9000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11472251/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Robotics and AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frobt.2024.1407519","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Predicting the consequences of the agent's actions on its environment is a pivotal challenge in robotic learning, which plays a key role in developing higher cognitive skills for intelligent robots. While current methods have predominantly relied on vision and motion data to generate the predicted videos, more comprehensive sensory perception is required for complex physical interactions such as contact-rich manipulation or highly dynamic tasks. In this work, we investigate the interdependence between vision and tactile sensation in the scenario of dynamic robotic interaction. A multi-modal fusion mechanism is introduced to the action-conditioned video prediction model to forecast future scenes, which enriches the single-modality prototype with a compressed latent representation of multiple sensory inputs. Additionally, to accomplish the interactive setting, we built a robotic interaction system that is equipped with both web cameras and vision-based tactile sensors to collect the dataset of vision-tactile sequences and the corresponding robot action data. Finally, through a series of qualitative and quantitative comparative study of different prediction architecture and tasks, we present insightful analysis of the cross-modality influence between vision, tactile and action, revealing the asymmetrical impact that exists between the sensations when contributing to interpreting the environment information. This opens possibilities for more adaptive and efficient robotic control in complex environments, with implications for dexterous manipulation and human-robot interaction.

连接视觉与触觉:利用自监督多模态学习推进机器人交互预测。
预测机器人行动对环境造成的后果是机器人学习中的一个关键挑战,这对开发智能机器人的高级认知技能起着关键作用。虽然目前的方法主要依赖视觉和运动数据来生成预测视频,但对于复杂的物理交互,如接触丰富的操作或高度动态的任务,需要更全面的感官感知。在这项工作中,我们研究了动态机器人交互场景中视觉和触觉之间的相互依存关系。为了预测未来场景,我们在动作条件视频预测模型中引入了多模态融合机制,利用多种感官输入的压缩潜表征来丰富单模态原型。此外,为了完成交互设置,我们建立了一个机器人交互系统,该系统配备了网络摄像头和基于视觉的触觉传感器,用于收集视觉-触觉序列数据集和相应的机器人动作数据。最后,通过对不同的预测架构和任务进行一系列定性和定量比较研究,我们对视觉、触觉和动作之间的跨模态影响进行了深入分析,揭示了各种感觉在解释环境信息时存在的不对称影响。这为在复杂环境中实现更具适应性和更高效的机器人控制提供了可能性,并对灵巧操纵和人机交互产生了影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.50
自引率
5.90%
发文量
355
审稿时长
14 weeks
期刊介绍: Frontiers in Robotics and AI publishes rigorously peer-reviewed research covering all theory and applications of robotics, technology, and artificial intelligence, from biomedical to space robotics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信