Multi-modal interaction with transformers: bridging robots and human with natural language

IF 2.7 4区 计算机科学 Q3 ROBOTICS
Robotica Pub Date : 2023-11-13 DOI:10.1017/s0263574723001510
Shaochen Wang, Zhangli Zhou, Bin Li, Zhijun Li, Zhen Kan
{"title":"Multi-modal interaction with transformers: bridging robots and human with natural language","authors":"Shaochen Wang, Zhangli Zhou, Bin Li, Zhijun Li, Zhen Kan","doi":"10.1017/s0263574723001510","DOIUrl":null,"url":null,"abstract":"Abstract The language-guided visual robotic grasping task focuses on enabling robots to grasp objects based on human language instructions. However, real-world human-robot collaboration tasks often involve situations with ambiguous language instructions and complex scenarios. These challenges arise in the understanding of linguistic queries, discrimination of key concepts in visual and language information, and generation of executable grasping configurations for the robot’s end-effector. To overcome these challenges, we propose a novel multi-modal transformer-based framework in this study, which assists robots in localizing spatial interactions of objects using text queries and visual sensing. This framework facilitates object grasping in accordance with human instructions. Our developed framework consists of two main components. First, a visual-linguistic transformer encoder is employed to model multi-modal interactions for objects referred to in the text. Second, the framework performs joint spatial localization and grasping. Extensive ablation studies have been conducted on multiple datasets to evaluate the advantages of each component in our model. Additionally, physical experiments have been performed with natural language-driven human-robot interactions on a physical robot to validate the practicality of our approach.","PeriodicalId":49593,"journal":{"name":"Robotica","volume":"45 22","pages":"0"},"PeriodicalIF":2.7000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/s0263574723001510","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ROBOTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract The language-guided visual robotic grasping task focuses on enabling robots to grasp objects based on human language instructions. However, real-world human-robot collaboration tasks often involve situations with ambiguous language instructions and complex scenarios. These challenges arise in the understanding of linguistic queries, discrimination of key concepts in visual and language information, and generation of executable grasping configurations for the robot’s end-effector. To overcome these challenges, we propose a novel multi-modal transformer-based framework in this study, which assists robots in localizing spatial interactions of objects using text queries and visual sensing. This framework facilitates object grasping in accordance with human instructions. Our developed framework consists of two main components. First, a visual-linguistic transformer encoder is employed to model multi-modal interactions for objects referred to in the text. Second, the framework performs joint spatial localization and grasping. Extensive ablation studies have been conducted on multiple datasets to evaluate the advantages of each component in our model. Additionally, physical experiments have been performed with natural language-driven human-robot interactions on a physical robot to validate the practicality of our approach.
与变压器的多模态交互:用自然语言架起机器人与人类的桥梁
语言引导视觉机器人抓取任务的重点是使机器人能够根据人类语言指令抓取物体。然而,现实世界的人机协作任务往往涉及语言指令不明确和场景复杂的情况。这些挑战出现在对语言查询的理解,视觉和语言信息中关键概念的区分,以及机器人末端执行器可执行抓取配置的生成。为了克服这些挑战,我们在本研究中提出了一种新的基于多模态变压器的框架,该框架可以帮助机器人使用文本查询和视觉感知来定位物体的空间交互。这个框架便于根据人类指令抓取物体。我们开发的框架由两个主要组件组成。首先,使用视觉语言转换编码器对文本中提到的对象进行多模态交互建模。其次,框架进行关节空间定位和抓取。广泛的消融研究已经在多个数据集上进行,以评估我们模型中每个组件的优势。此外,物理实验已经在物理机器人上进行了自然语言驱动的人机交互,以验证我们方法的实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Robotica
Robotica 工程技术-机器人学
CiteScore
4.50
自引率
22.20%
发文量
181
审稿时长
9.9 months
期刊介绍: Robotica is a forum for the multidisciplinary subject of robotics and encourages developments, applications and research in this important field of automation and robotics with regard to industry, health, education and economic and social aspects of relevance. Coverage includes activities in hostile environments, applications in the service and manufacturing industries, biological robotics, dynamics and kinematics involved in robot design and uses, on-line robots, robot task planning, rehabilitation robotics, sensory perception, software in the widest sense, particularly in respect of programming languages and links with CAD/CAM systems, telerobotics and various other areas. In addition, interest is focused on various Artificial Intelligence topics of theoretical and practical interest.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信