梦境排列：基于去噪扩散和VLM规划的语言条件机器人物体重排学习

IF 8.7 1区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Systems Man Cybernetics-Systems Pub Date : 2025-09-24 DOI:10.1109/TSMC.2025.3611698

Wenkai Chen;Changming Xiao;Ge Gao;Fuchun Sun;Changshui Zhang;Jianwei Zhang

{"title":"梦境排列：基于去噪扩散和VLM规划的语言条件机器人物体重排学习","authors":"Wenkai Chen;Changming Xiao;Ge Gao;Fuchun Sun;Changshui Zhang;Jianwei Zhang","doi":"10.1109/TSMC.2025.3611698","DOIUrl":null,"url":null,"abstract":"The capability for robotic systems to rearrange objects based on human instructions represents a critical step toward realizing embodied intelligence. Recently, diffusion-based learning has shown significant advancements in the field of data generation while prompt-based learning has proven effective in formulating robot manipulation strategies. However, prior solutions for robotic rearrangement have overlooked the significance of integrating human preferences and optimizing for rearrangement efficiency. Additionally, traditional prompt-based approaches struggle with complex, semantically meaningful rearrangement tasks without predefined target states for objects. To address these challenges, our work first introduces a comprehensive two dimensional (2-D) tabletop rearrangement dataset, utilizing a physical simulator to capture interobject relationships and semantic configurations. Then, we present DreamArrangement, a novel language-conditioned object rearrangement scheme, consisting of two primary processes: employing a transformer-based multimodal denoising diffusion model to envisage the desired arrangement of objects, and leveraging a vision–language foundational model to derive actionable policies from text, alongside initial and target visual information. In particular, we introduce an efficiency-oriented learning strategy to minimize the average motion distance of objects. Given few-shot instruction examples, the learned policy from our synthetic dataset can be transferred to the real world without extra human intervention. Extensive simulations validate DreamArrangement’s superior rearrangement quality and efficiency. Moreover, real-world robotic experiments confirm that our method can adeptly execute a range of challenging, language-conditioned, and long-horizon tasks with a singular model. The demonstration video can be found at <uri>https://youtu.be/fq25-DjrbQE</uri>","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 11","pages":"8675-8688"},"PeriodicalIF":8.7000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DreamArrangement: Learning Language-Conditioned Robotic Rearrangement of Objects via Denoising Diffusion and VLM Planner\",\"authors\":\"Wenkai Chen;Changming Xiao;Ge Gao;Fuchun Sun;Changshui Zhang;Jianwei Zhang\",\"doi\":\"10.1109/TSMC.2025.3611698\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The capability for robotic systems to rearrange objects based on human instructions represents a critical step toward realizing embodied intelligence. Recently, diffusion-based learning has shown significant advancements in the field of data generation while prompt-based learning has proven effective in formulating robot manipulation strategies. However, prior solutions for robotic rearrangement have overlooked the significance of integrating human preferences and optimizing for rearrangement efficiency. Additionally, traditional prompt-based approaches struggle with complex, semantically meaningful rearrangement tasks without predefined target states for objects. To address these challenges, our work first introduces a comprehensive two dimensional (2-D) tabletop rearrangement dataset, utilizing a physical simulator to capture interobject relationships and semantic configurations. Then, we present DreamArrangement, a novel language-conditioned object rearrangement scheme, consisting of two primary processes: employing a transformer-based multimodal denoising diffusion model to envisage the desired arrangement of objects, and leveraging a vision–language foundational model to derive actionable policies from text, alongside initial and target visual information. In particular, we introduce an efficiency-oriented learning strategy to minimize the average motion distance of objects. Given few-shot instruction examples, the learned policy from our synthetic dataset can be transferred to the real world without extra human intervention. Extensive simulations validate DreamArrangement’s superior rearrangement quality and efficiency. Moreover, real-world robotic experiments confirm that our method can adeptly execute a range of challenging, language-conditioned, and long-horizon tasks with a singular model. The demonstration video can be found at <uri>https://youtu.be/fq25-DjrbQE</uri>\",\"PeriodicalId\":48915,\"journal\":{\"name\":\"IEEE Transactions on Systems Man Cybernetics-Systems\",\"volume\":\"55 11\",\"pages\":\"8675-8688\"},\"PeriodicalIF\":8.7000,\"publicationDate\":\"2025-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Systems Man Cybernetics-Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11176993/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11176993/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

机器人系统根据人类指令重新排列物体的能力是实现具身智能的关键一步。近年来，基于扩散的学习在数据生成领域取得了重大进展，而基于提示的学习在制定机器人操作策略方面已被证明是有效的。然而，现有的机器人重排解决方案忽视了整合人类偏好和优化重排效率的重要性。此外，传统的基于提示的方法难以处理复杂的、语义上有意义的重排任务，因为没有预定义的对象目标状态。为了应对这些挑战，我们的工作首先引入了一个全面的二维（2-D）桌面重排数据集，利用物理模拟器捕获对象间关系和语义配置。然后，我们提出了DreamArrangement，这是一种新的语言条件下的对象重排方案，由两个主要过程组成：采用基于转换器的多模态去噪扩散模型来设想所需的对象排列，并利用视觉语言基础模型从文本以及初始和目标视觉信息中导出可操作的策略。特别地，我们引入了一种以效率为导向的学习策略来最小化物体的平均运动距离。给定少量的指令示例，从我们的合成数据集中学习到的策略可以转移到现实世界中，而无需额外的人为干预。大量的仿真验证了DreamArrangement优越的重排质量和效率。此外，现实世界的机器人实验证实，我们的方法可以熟练地执行一系列具有挑战性的、语言条件的、长期的任务。该演示视频可在https://youtu.be/fq25-DjrbQE上找到

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DreamArrangement: Learning Language-Conditioned Robotic Rearrangement of Objects via Denoising Diffusion and VLM Planner

The capability for robotic systems to rearrange objects based on human instructions represents a critical step toward realizing embodied intelligence. Recently, diffusion-based learning has shown significant advancements in the field of data generation while prompt-based learning has proven effective in formulating robot manipulation strategies. However, prior solutions for robotic rearrangement have overlooked the significance of integrating human preferences and optimizing for rearrangement efficiency. Additionally, traditional prompt-based approaches struggle with complex, semantically meaningful rearrangement tasks without predefined target states for objects. To address these challenges, our work first introduces a comprehensive two dimensional (2-D) tabletop rearrangement dataset, utilizing a physical simulator to capture interobject relationships and semantic configurations. Then, we present DreamArrangement, a novel language-conditioned object rearrangement scheme, consisting of two primary processes: employing a transformer-based multimodal denoising diffusion model to envisage the desired arrangement of objects, and leveraging a vision–language foundational model to derive actionable policies from text, alongside initial and target visual information. In particular, we introduce an efficiency-oriented learning strategy to minimize the average motion distance of objects. Given few-shot instruction examples, the learned policy from our synthetic dataset can be transferred to the real world without extra human intervention. Extensive simulations validate DreamArrangement’s superior rearrangement quality and efficiency. Moreover, real-world robotic experiments confirm that our method can adeptly execute a range of challenging, language-conditioned, and long-horizon tasks with a singular model. The demonstration video can be found at https://youtu.be/fq25-DjrbQE

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Systems Man Cybernetics-Systems AUTOMATION & CONTROL SYSTEMS-COMPUTER SCIENCE, CYBERNETICS

CiteScore

18.50

自引率

11.50%

发文量

812

审稿时长

6 months

期刊介绍： The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.