{"title":"梦境排列:基于去噪扩散和VLM规划的语言条件机器人物体重排学习","authors":"Wenkai Chen;Changming Xiao;Ge Gao;Fuchun Sun;Changshui Zhang;Jianwei Zhang","doi":"10.1109/TSMC.2025.3611698","DOIUrl":null,"url":null,"abstract":"The capability for robotic systems to rearrange objects based on human instructions represents a critical step toward realizing embodied intelligence. Recently, diffusion-based learning has shown significant advancements in the field of data generation while prompt-based learning has proven effective in formulating robot manipulation strategies. However, prior solutions for robotic rearrangement have overlooked the significance of integrating human preferences and optimizing for rearrangement efficiency. Additionally, traditional prompt-based approaches struggle with complex, semantically meaningful rearrangement tasks without predefined target states for objects. To address these challenges, our work first introduces a comprehensive two dimensional (2-D) tabletop rearrangement dataset, utilizing a physical simulator to capture interobject relationships and semantic configurations. Then, we present DreamArrangement, a novel language-conditioned object rearrangement scheme, consisting of two primary processes: employing a transformer-based multimodal denoising diffusion model to envisage the desired arrangement of objects, and leveraging a vision–language foundational model to derive actionable policies from text, alongside initial and target visual information. In particular, we introduce an efficiency-oriented learning strategy to minimize the average motion distance of objects. Given few-shot instruction examples, the learned policy from our synthetic dataset can be transferred to the real world without extra human intervention. Extensive simulations validate DreamArrangement’s superior rearrangement quality and efficiency. Moreover, real-world robotic experiments confirm that our method can adeptly execute a range of challenging, language-conditioned, and long-horizon tasks with a singular model. The demonstration video can be found at <uri>https://youtu.be/fq25-DjrbQE</uri>","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 11","pages":"8675-8688"},"PeriodicalIF":8.7000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DreamArrangement: Learning Language-Conditioned Robotic Rearrangement of Objects via Denoising Diffusion and VLM Planner\",\"authors\":\"Wenkai Chen;Changming Xiao;Ge Gao;Fuchun Sun;Changshui Zhang;Jianwei Zhang\",\"doi\":\"10.1109/TSMC.2025.3611698\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The capability for robotic systems to rearrange objects based on human instructions represents a critical step toward realizing embodied intelligence. Recently, diffusion-based learning has shown significant advancements in the field of data generation while prompt-based learning has proven effective in formulating robot manipulation strategies. However, prior solutions for robotic rearrangement have overlooked the significance of integrating human preferences and optimizing for rearrangement efficiency. Additionally, traditional prompt-based approaches struggle with complex, semantically meaningful rearrangement tasks without predefined target states for objects. To address these challenges, our work first introduces a comprehensive two dimensional (2-D) tabletop rearrangement dataset, utilizing a physical simulator to capture interobject relationships and semantic configurations. Then, we present DreamArrangement, a novel language-conditioned object rearrangement scheme, consisting of two primary processes: employing a transformer-based multimodal denoising diffusion model to envisage the desired arrangement of objects, and leveraging a vision–language foundational model to derive actionable policies from text, alongside initial and target visual information. In particular, we introduce an efficiency-oriented learning strategy to minimize the average motion distance of objects. Given few-shot instruction examples, the learned policy from our synthetic dataset can be transferred to the real world without extra human intervention. Extensive simulations validate DreamArrangement’s superior rearrangement quality and efficiency. Moreover, real-world robotic experiments confirm that our method can adeptly execute a range of challenging, language-conditioned, and long-horizon tasks with a singular model. The demonstration video can be found at <uri>https://youtu.be/fq25-DjrbQE</uri>\",\"PeriodicalId\":48915,\"journal\":{\"name\":\"IEEE Transactions on Systems Man Cybernetics-Systems\",\"volume\":\"55 11\",\"pages\":\"8675-8688\"},\"PeriodicalIF\":8.7000,\"publicationDate\":\"2025-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Systems Man Cybernetics-Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11176993/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11176993/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
DreamArrangement: Learning Language-Conditioned Robotic Rearrangement of Objects via Denoising Diffusion and VLM Planner
The capability for robotic systems to rearrange objects based on human instructions represents a critical step toward realizing embodied intelligence. Recently, diffusion-based learning has shown significant advancements in the field of data generation while prompt-based learning has proven effective in formulating robot manipulation strategies. However, prior solutions for robotic rearrangement have overlooked the significance of integrating human preferences and optimizing for rearrangement efficiency. Additionally, traditional prompt-based approaches struggle with complex, semantically meaningful rearrangement tasks without predefined target states for objects. To address these challenges, our work first introduces a comprehensive two dimensional (2-D) tabletop rearrangement dataset, utilizing a physical simulator to capture interobject relationships and semantic configurations. Then, we present DreamArrangement, a novel language-conditioned object rearrangement scheme, consisting of two primary processes: employing a transformer-based multimodal denoising diffusion model to envisage the desired arrangement of objects, and leveraging a vision–language foundational model to derive actionable policies from text, alongside initial and target visual information. In particular, we introduce an efficiency-oriented learning strategy to minimize the average motion distance of objects. Given few-shot instruction examples, the learned policy from our synthetic dataset can be transferred to the real world without extra human intervention. Extensive simulations validate DreamArrangement’s superior rearrangement quality and efficiency. Moreover, real-world robotic experiments confirm that our method can adeptly execute a range of challenging, language-conditioned, and long-horizon tasks with a singular model. The demonstration video can be found at https://youtu.be/fq25-DjrbQE
期刊介绍:
The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.