GR-MG: Leveraging Partially-Annotated Data via Multi-Modal Goal-Conditioned Policy

IF 4.6 2区计算机科学 Q2 ROBOTICS

IEEE Robotics and Automation Letters Pub Date : 2025-01-06 DOI:10.1109/LRA.2025.3526436

Peiyan Li;Hongtao Wu;Yan Huang;Chilam Cheang;Liang Wang;Tao Kong

{"title":"GR-MG: Leveraging Partially-Annotated Data via Multi-Modal Goal-Conditioned Policy","authors":"Peiyan Li;Hongtao Wu;Yan Huang;Chilam Cheang;Liang Wang;Tao Kong","doi":"10.1109/LRA.2025.3526436","DOIUrl":null,"url":null,"abstract":"The robotics community has consistently aimed to achieve generalizable robot manipulation with flexible natural language instructions. One primary challenge is that obtaining robot trajectories fully annotated with both actions and texts is time-consuming and labor-intensive. However, partially-annotated data, such as human activity videos without action labels and robot trajectories without text labels, are much easier to collect. Can we leverage these data to enhance the generalization capabilities of robots? In this letter, we propose GR-MG, a novel method which supports conditioning on a text instruction and a goal image. During training, GR-MG samples goal images from trajectories and conditions on both the text and the goal image or solely on the image when text is not available. During inference, where only the text is provided, GR-MG generates the goal image via a diffusion-based image-editing model and conditions on both the text and the generated image. This approach enables GR-MG to leverage large amounts of partially-annotated data while still using languages to flexibly specify tasks. To generate accurate goal images, we propose a novel progress-guided goal image generation model which injects task progress information into the generation process. In simulation experiments, GR-MG improves the average number of tasks completed in a row of 5 from 3.35 to 4.04. In real-robot experiments, GR-MG is able to perform 58 different tasks and improves the success rate from 68.7% to 78.1% and 44.4% to 60.6% in simple and generalization settings, respectively. It also outperforms comparing baseline methods in few-shot learning of novel skills.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 2","pages":"1912-1919"},"PeriodicalIF":4.6000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10829675/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The robotics community has consistently aimed to achieve generalizable robot manipulation with flexible natural language instructions. One primary challenge is that obtaining robot trajectories fully annotated with both actions and texts is time-consuming and labor-intensive. However, partially-annotated data, such as human activity videos without action labels and robot trajectories without text labels, are much easier to collect. Can we leverage these data to enhance the generalization capabilities of robots? In this letter, we propose GR-MG, a novel method which supports conditioning on a text instruction and a goal image. During training, GR-MG samples goal images from trajectories and conditions on both the text and the goal image or solely on the image when text is not available. During inference, where only the text is provided, GR-MG generates the goal image via a diffusion-based image-editing model and conditions on both the text and the generated image. This approach enables GR-MG to leverage large amounts of partially-annotated data while still using languages to flexibly specify tasks. To generate accurate goal images, we propose a novel progress-guided goal image generation model which injects task progress information into the generation process. In simulation experiments, GR-MG improves the average number of tasks completed in a row of 5 from 3.35 to 4.04. In real-robot experiments, GR-MG is able to perform 58 different tasks and improves the success rate from 68.7% to 78.1% and 44.4% to 60.6% in simple and generalization settings, respectively. It also outperforms comparing baseline methods in few-shot learning of novel skills.

查看原文本刊更多论文

GR-MG：通过多模态目标条件策略利用部分注释数据

机器人社区一直致力于用灵活的自然语言指令实现可泛化的机器人操作。一个主要的挑战是获得机器人轨迹的动作和文本完全注释是耗时和劳动密集型的。然而，部分注释的数据，如没有动作标签的人类活动视频和没有文本标签的机器人轨迹，更容易收集。我们能否利用这些数据来增强机器人的泛化能力？在本文中，我们提出了一种支持文本指令和目标图像条件反射的新方法GR-MG。在训练过程中，GR-MG从文本和目标图像的轨迹和条件中采样目标图像，或者在没有文本时仅对图像进行采样。在只提供文本的推理过程中，GR-MG通过基于扩散的图像编辑模型和文本和生成图像的条件来生成目标图像。这种方法使GR-MG能够利用大量部分注释的数据，同时仍然使用语言灵活地指定任务。为了生成精确的目标图像，我们提出了一种新的进度导向目标图像生成模型，该模型将任务进度信息注入到生成过程中。在仿真实验中，GR-MG将每行5个任务的平均完成次数从3.35次提高到4.04次。在真实机器人实验中，GR-MG能够执行58个不同的任务，在简单和泛化设置下的成功率分别从68.7%提高到78.1%和44.4%提高到60.6%。在新技能的少量学习中，它也优于比较基线方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Robotics and Automation Letters Computer Science-Computer Science Applications

CiteScore

9.60

自引率

15.40%

发文量

1428

期刊介绍： The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.