Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM

IF 4.6 2区计算机科学 Q2 ROBOTICS

IEEE Robotics and Automation Letters Pub Date : 2024-12-04 DOI:10.1109/LRA.2024.3511402

Jiawei Wang;Teng Wang;Wenzhe Cai;Lele Xu;Changyin Sun

{"title":"Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM","authors":"Jiawei Wang;Teng Wang;Wenzhe Cai;Lele Xu;Changyin Sun","doi":"10.1109/LRA.2024.3511402","DOIUrl":null,"url":null,"abstract":"Vision-and-Language Navigation (VLN) requires an agent to navigate in photo-realistic environments based on language instructions. Existing methods typically employ imitation learning to train agents. However, approaches based on recurrent neural networks suffer from poor generalization, while transformer-based methods are too large in scale for practical deployment. In contrast, reinforcement learning (RL) agents can overcome dataset limitations and learn navigation policies that adapt to environment changes. However, without expert trajectories for supervision, agents struggle to learn effective long-term navigation policies from sparse environment rewards. Instruction decomposition enables agents to learn value estimation faster, making agents more efficient in learning VLN tasks. We propose the Decomposing Instructions with Large Language Models for Vision-and-Language Navigation (DILLM-VLN) method, which decomposes complex navigation instructions into simple, interpretable sub-instructions using a lightweight, open-sourced LLM and trains RL agents to complete these sub-instructions sequentially. Based on these interpretable sub-instructions, we introduce the cascaded multi-scale attention (CMA) and a novel multi-modal fusion discriminator (MFD). CMA integrates instruction features at different scales to provide precise textual guidance. MFD combines scene, object, and action information to comprehensively assess the completion of sub-instructions. Experiment results show that DILLM-VLN significantly improves baseline performance, demonstrating its potential for practical applications.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"612-619"},"PeriodicalIF":4.6000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10777561/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-and-Language Navigation (VLN) requires an agent to navigate in photo-realistic environments based on language instructions. Existing methods typically employ imitation learning to train agents. However, approaches based on recurrent neural networks suffer from poor generalization, while transformer-based methods are too large in scale for practical deployment. In contrast, reinforcement learning (RL) agents can overcome dataset limitations and learn navigation policies that adapt to environment changes. However, without expert trajectories for supervision, agents struggle to learn effective long-term navigation policies from sparse environment rewards. Instruction decomposition enables agents to learn value estimation faster, making agents more efficient in learning VLN tasks. We propose the Decomposing Instructions with Large Language Models for Vision-and-Language Navigation (DILLM-VLN) method, which decomposes complex navigation instructions into simple, interpretable sub-instructions using a lightweight, open-sourced LLM and trains RL agents to complete these sub-instructions sequentially. Based on these interpretable sub-instructions, we introduce the cascaded multi-scale attention (CMA) and a novel multi-modal fusion discriminator (MFD). CMA integrates instruction features at different scales to provide precise textual guidance. MFD combines scene, object, and action information to comprehensively assess the completion of sub-instructions. Experiment results show that DILLM-VLN significantly improves baseline performance, demonstrating its potential for practical applications.

查看原文本刊更多论文

利用开源LLM促进视觉和语言导航的有效强化学习

视觉和语言导航（VLN）要求智能体根据语言指令在逼真的环境中进行导航。现有的方法通常采用模仿学习来训练代理。然而，基于递归神经网络的方法泛化能力差，而基于变压器的方法规模太大，难以实际部署。相比之下，强化学习（RL）智能体可以克服数据集限制，学习适应环境变化的导航策略。然而，没有专家轨迹的监督，智能体很难从稀疏的环境奖励中学习有效的长期导航策略。指令分解使智能体能够更快地学习估计值，从而提高智能体学习VLN任务的效率。我们提出了基于大型语言模型的视觉语言导航分解指令（DILLM-VLN）方法，该方法使用轻量级、开源的LLM将复杂的导航指令分解为简单、可解释的子指令，并训练RL代理依次完成这些子指令。基于这些可解释的子指令，我们引入了级联多尺度注意（CMA）和一种新的多模态融合鉴别器（MFD）。CMA整合了不同尺度的教学功能，提供精确的文本指导。MFD结合场景、对象和动作信息，综合评估子指令的完成情况。实验结果表明，DILLM-VLN显著提高了基线性能，显示了其实际应用潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Robotics and Automation Letters Computer Science-Computer Science Applications

CiteScore

9.60

自引率

15.40%

发文量

1428

期刊介绍： The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.