Kejia Chen, Zheng Shen, Yue Zhang, Lingyun Chen, Fan Wu, Zhenshan Bing, Sami Haddadin, Alois Knoll
{"title":"从多模式演示中学习任务规划,实现多阶段接触式丰富操纵","authors":"Kejia Chen, Zheng Shen, Yue Zhang, Lingyun Chen, Fan Wu, Zhenshan Bing, Sami Haddadin, Alois Knoll","doi":"arxiv-2409.11863","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have gained popularity in task planning for\nlong-horizon manipulation tasks. To enhance the validity of LLM-generated\nplans, visual demonstrations and online videos have been widely employed to\nguide the planning process. However, for manipulation tasks involving subtle\nmovements but rich contact interactions, visual perception alone may be\ninsufficient for the LLM to fully interpret the demonstration. Additionally,\nvisual data provides limited information on force-related parameters and\nconditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that\nincorporates tactile and force-torque information from human demonstrations to\nenhance LLMs' ability to generate plans for new task scenarios. We propose a\nbootstrapped reasoning pipeline that sequentially integrates each modality into\na comprehensive task plan. This task plan is then used as a reference for\nplanning in new task configurations. Real-world experiments on two different\nsequential manipulation tasks demonstrate the effectiveness of our framework in\nimproving LLMs' understanding of multi-modal demonstrations and enhancing the\noverall planning performance.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning Task Planning from Multi-Modal Demonstration for Multi-Stage Contact-Rich Manipulation\",\"authors\":\"Kejia Chen, Zheng Shen, Yue Zhang, Lingyun Chen, Fan Wu, Zhenshan Bing, Sami Haddadin, Alois Knoll\",\"doi\":\"arxiv-2409.11863\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) have gained popularity in task planning for\\nlong-horizon manipulation tasks. To enhance the validity of LLM-generated\\nplans, visual demonstrations and online videos have been widely employed to\\nguide the planning process. However, for manipulation tasks involving subtle\\nmovements but rich contact interactions, visual perception alone may be\\ninsufficient for the LLM to fully interpret the demonstration. Additionally,\\nvisual data provides limited information on force-related parameters and\\nconditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that\\nincorporates tactile and force-torque information from human demonstrations to\\nenhance LLMs' ability to generate plans for new task scenarios. We propose a\\nbootstrapped reasoning pipeline that sequentially integrates each modality into\\na comprehensive task plan. This task plan is then used as a reference for\\nplanning in new task configurations. Real-world experiments on two different\\nsequential manipulation tasks demonstrate the effectiveness of our framework in\\nimproving LLMs' understanding of multi-modal demonstrations and enhancing the\\noverall planning performance.\",\"PeriodicalId\":501031,\"journal\":{\"name\":\"arXiv - CS - Robotics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Robotics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11863\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11863","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Learning Task Planning from Multi-Modal Demonstration for Multi-Stage Contact-Rich Manipulation
Large Language Models (LLMs) have gained popularity in task planning for
long-horizon manipulation tasks. To enhance the validity of LLM-generated
plans, visual demonstrations and online videos have been widely employed to
guide the planning process. However, for manipulation tasks involving subtle
movements but rich contact interactions, visual perception alone may be
insufficient for the LLM to fully interpret the demonstration. Additionally,
visual data provides limited information on force-related parameters and
conditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that
incorporates tactile and force-torque information from human demonstrations to
enhance LLMs' ability to generate plans for new task scenarios. We propose a
bootstrapped reasoning pipeline that sequentially integrates each modality into
a comprehensive task plan. This task plan is then used as a reference for
planning in new task configurations. Real-world experiments on two different
sequential manipulation tasks demonstrate the effectiveness of our framework in
improving LLMs' understanding of multi-modal demonstrations and enhancing the
overall planning performance.