感知、思考和规划：为无指令的目标导向型城市导航设计 LLM Agent

arXiv - CS - Artificial Intelligence Pub Date : 2024-08-08 DOI:arxiv-2408.04168

Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li

{"title":"感知、思考和规划：为无指令的目标导向型城市导航设计 LLM Agent","authors":"Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li","doi":"arxiv-2408.04168","DOIUrl":null,"url":null,"abstract":"This paper considers a scenario in city navigation: an AI agent is provided\nwith language descriptions of the goal location with respect to some well-known\nlandmarks; By only observing the scene around, including recognizing landmarks\nand road network connections, the agent has to make decisions to navigate to\nthe goal location without instructions. This problem is very challenging,\nbecause it requires agent to establish self-position and acquire spatial\nrepresentation of complex urban environment, where landmarks are often\ninvisible. In the absence of navigation instructions, such abilities are vital\nfor the agent to make high-quality decisions in long-range city navigation.\nWith the emergent reasoning ability of large language models (LLMs), a tempting\nbaseline is to prompt LLMs to \"react\" on each observation and make decisions\naccordingly. However, this baseline has very poor performance that the agent\noften repeatedly visits same locations and make short-sighted, inconsistent\ndecisions. To address these issues, this paper introduces a novel agentic\nworkflow featured by its abilities to perceive, reflect and plan. Specifically,\nwe find LLaVA-7B can be fine-tuned to perceive the direction and distance of\nlandmarks with sufficient accuracy for city navigation. Moreover, reflection is\nachieved through a memory mechanism, where past experiences are stored and can\nbe retrieved with current perception for effective decision argumentation.\nPlanning uses reflection results to produce long-term plans, which can avoid\nshort-sighted decisions in long-range navigation. We show the designed workflow\nsignificantly improves navigation ability of the LLM agent compared with the\nstate-of-the-art baselines.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions\",\"authors\":\"Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li\",\"doi\":\"arxiv-2408.04168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper considers a scenario in city navigation: an AI agent is provided\\nwith language descriptions of the goal location with respect to some well-known\\nlandmarks; By only observing the scene around, including recognizing landmarks\\nand road network connections, the agent has to make decisions to navigate to\\nthe goal location without instructions. This problem is very challenging,\\nbecause it requires agent to establish self-position and acquire spatial\\nrepresentation of complex urban environment, where landmarks are often\\ninvisible. In the absence of navigation instructions, such abilities are vital\\nfor the agent to make high-quality decisions in long-range city navigation.\\nWith the emergent reasoning ability of large language models (LLMs), a tempting\\nbaseline is to prompt LLMs to \\\"react\\\" on each observation and make decisions\\naccordingly. However, this baseline has very poor performance that the agent\\noften repeatedly visits same locations and make short-sighted, inconsistent\\ndecisions. To address these issues, this paper introduces a novel agentic\\nworkflow featured by its abilities to perceive, reflect and plan. Specifically,\\nwe find LLaVA-7B can be fine-tuned to perceive the direction and distance of\\nlandmarks with sufficient accuracy for city navigation. Moreover, reflection is\\nachieved through a memory mechanism, where past experiences are stored and can\\nbe retrieved with current perception for effective decision argumentation.\\nPlanning uses reflection results to produce long-term plans, which can avoid\\nshort-sighted decisions in long-range navigation. We show the designed workflow\\nsignificantly improves navigation ability of the LLM agent compared with the\\nstate-of-the-art baselines.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"27 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.04168\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文考虑了城市导航中的一个场景：一个人工智能代理被提供了目标位置与一些众所周知的地标之间的语言描述；通过只观察周围的场景，包括识别地标和道路网络连接，代理必须在没有指令的情况下做出导航到目标位置的决策。这个问题非常具有挑战性，因为它要求代理建立自我定位，并获得复杂城市环境的空间描述，而在复杂的城市环境中，地标往往是不可见的。由于大型语言模型（LLM）具有新兴的推理能力，一种诱人的基准是促使 LLM 对每次观察做出 "反应"，并据此做出决策。然而，这种基线的性能非常差，代理经常重复访问相同的地点，并做出短视、不一致的决定。为了解决这些问题，本文引入了一种新型的代理工作流，其特点是具有感知、反思和规划能力。具体来说，我们发现 LLaVA-7B 可以进行微调，以足够的精度感知地标的方向和距离，从而实现城市导航。此外，LLaVA-7B 还能通过记忆机制进行反思，将过去的经验储存起来，并结合当前的感知进行检索，从而有效地进行决策论证。我们的研究表明，与最先进的基线相比，所设计的工作流程显著提高了 LLM 代理的导航能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions

This paper considers a scenario in city navigation: an AI agent is provided with language descriptions of the goal location with respect to some well-known landmarks; By only observing the scene around, including recognizing landmarks and road network connections, the agent has to make decisions to navigate to the goal location without instructions. This problem is very challenging, because it requires agent to establish self-position and acquire spatial representation of complex urban environment, where landmarks are often invisible. In the absence of navigation instructions, such abilities are vital for the agent to make high-quality decisions in long-range city navigation. With the emergent reasoning ability of large language models (LLMs), a tempting baseline is to prompt LLMs to "react" on each observation and make decisions accordingly. However, this baseline has very poor performance that the agent often repeatedly visits same locations and make short-sighted, inconsistent decisions. To address these issues, this paper introduces a novel agentic workflow featured by its abilities to perceive, reflect and plan. Specifically, we find LLaVA-7B can be fine-tuned to perceive the direction and distance of landmarks with sufficient accuracy for city navigation. Moreover, reflection is achieved through a memory mechanism, where past experiences are stored and can be retrieved with current perception for effective decision argumentation. Planning uses reflection results to produce long-term plans, which can avoid short-sighted decisions in long-range navigation. We show the designed workflow significantly improves navigation ability of the LLM agent compared with the state-of-the-art baselines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Artificial Intelligence

自引率

0.00%

发文量