Shaochen Wang, Zhangli Zhou, Bin Li, Zhijun Li, Zhen Kan
{"title":"与变压器的多模态交互:用自然语言架起机器人与人类的桥梁","authors":"Shaochen Wang, Zhangli Zhou, Bin Li, Zhijun Li, Zhen Kan","doi":"10.1017/s0263574723001510","DOIUrl":null,"url":null,"abstract":"Abstract The language-guided visual robotic grasping task focuses on enabling robots to grasp objects based on human language instructions. However, real-world human-robot collaboration tasks often involve situations with ambiguous language instructions and complex scenarios. These challenges arise in the understanding of linguistic queries, discrimination of key concepts in visual and language information, and generation of executable grasping configurations for the robot’s end-effector. To overcome these challenges, we propose a novel multi-modal transformer-based framework in this study, which assists robots in localizing spatial interactions of objects using text queries and visual sensing. This framework facilitates object grasping in accordance with human instructions. Our developed framework consists of two main components. First, a visual-linguistic transformer encoder is employed to model multi-modal interactions for objects referred to in the text. Second, the framework performs joint spatial localization and grasping. Extensive ablation studies have been conducted on multiple datasets to evaluate the advantages of each component in our model. Additionally, physical experiments have been performed with natural language-driven human-robot interactions on a physical robot to validate the practicality of our approach.","PeriodicalId":49593,"journal":{"name":"Robotica","volume":"45 22","pages":"0"},"PeriodicalIF":2.7000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-modal interaction with transformers: bridging robots and human with natural language\",\"authors\":\"Shaochen Wang, Zhangli Zhou, Bin Li, Zhijun Li, Zhen Kan\",\"doi\":\"10.1017/s0263574723001510\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract The language-guided visual robotic grasping task focuses on enabling robots to grasp objects based on human language instructions. However, real-world human-robot collaboration tasks often involve situations with ambiguous language instructions and complex scenarios. These challenges arise in the understanding of linguistic queries, discrimination of key concepts in visual and language information, and generation of executable grasping configurations for the robot’s end-effector. To overcome these challenges, we propose a novel multi-modal transformer-based framework in this study, which assists robots in localizing spatial interactions of objects using text queries and visual sensing. This framework facilitates object grasping in accordance with human instructions. Our developed framework consists of two main components. First, a visual-linguistic transformer encoder is employed to model multi-modal interactions for objects referred to in the text. Second, the framework performs joint spatial localization and grasping. Extensive ablation studies have been conducted on multiple datasets to evaluate the advantages of each component in our model. Additionally, physical experiments have been performed with natural language-driven human-robot interactions on a physical robot to validate the practicality of our approach.\",\"PeriodicalId\":49593,\"journal\":{\"name\":\"Robotica\",\"volume\":\"45 22\",\"pages\":\"0\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2023-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Robotica\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1017/s0263574723001510\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/s0263574723001510","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ROBOTICS","Score":null,"Total":0}
Multi-modal interaction with transformers: bridging robots and human with natural language
Abstract The language-guided visual robotic grasping task focuses on enabling robots to grasp objects based on human language instructions. However, real-world human-robot collaboration tasks often involve situations with ambiguous language instructions and complex scenarios. These challenges arise in the understanding of linguistic queries, discrimination of key concepts in visual and language information, and generation of executable grasping configurations for the robot’s end-effector. To overcome these challenges, we propose a novel multi-modal transformer-based framework in this study, which assists robots in localizing spatial interactions of objects using text queries and visual sensing. This framework facilitates object grasping in accordance with human instructions. Our developed framework consists of two main components. First, a visual-linguistic transformer encoder is employed to model multi-modal interactions for objects referred to in the text. Second, the framework performs joint spatial localization and grasping. Extensive ablation studies have been conducted on multiple datasets to evaluate the advantages of each component in our model. Additionally, physical experiments have been performed with natural language-driven human-robot interactions on a physical robot to validate the practicality of our approach.
期刊介绍:
Robotica is a forum for the multidisciplinary subject of robotics and encourages developments, applications and research in this important field of automation and robotics with regard to industry, health, education and economic and social aspects of relevance. Coverage includes activities in hostile environments, applications in the service and manufacturing industries, biological robotics, dynamics and kinematics involved in robot design and uses, on-line robots, robot task planning, rehabilitation robotics, sensory perception, software in the widest sense, particularly in respect of programming languages and links with CAD/CAM systems, telerobotics and various other areas. In addition, interest is focused on various Artificial Intelligence topics of theoretical and practical interest.