Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li
{"title":"感知、思考和规划:为无指令的目标导向型城市导航设计 LLM Agent","authors":"Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li","doi":"arxiv-2408.04168","DOIUrl":null,"url":null,"abstract":"This paper considers a scenario in city navigation: an AI agent is provided\nwith language descriptions of the goal location with respect to some well-known\nlandmarks; By only observing the scene around, including recognizing landmarks\nand road network connections, the agent has to make decisions to navigate to\nthe goal location without instructions. This problem is very challenging,\nbecause it requires agent to establish self-position and acquire spatial\nrepresentation of complex urban environment, where landmarks are often\ninvisible. In the absence of navigation instructions, such abilities are vital\nfor the agent to make high-quality decisions in long-range city navigation.\nWith the emergent reasoning ability of large language models (LLMs), a tempting\nbaseline is to prompt LLMs to \"react\" on each observation and make decisions\naccordingly. However, this baseline has very poor performance that the agent\noften repeatedly visits same locations and make short-sighted, inconsistent\ndecisions. To address these issues, this paper introduces a novel agentic\nworkflow featured by its abilities to perceive, reflect and plan. Specifically,\nwe find LLaVA-7B can be fine-tuned to perceive the direction and distance of\nlandmarks with sufficient accuracy for city navigation. Moreover, reflection is\nachieved through a memory mechanism, where past experiences are stored and can\nbe retrieved with current perception for effective decision argumentation.\nPlanning uses reflection results to produce long-term plans, which can avoid\nshort-sighted decisions in long-range navigation. We show the designed workflow\nsignificantly improves navigation ability of the LLM agent compared with the\nstate-of-the-art baselines.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions\",\"authors\":\"Qingbin Zeng, Qinglong Yang, Shunan Dong, Heming Du, Liang Zheng, Fengli Xu, Yong Li\",\"doi\":\"arxiv-2408.04168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper considers a scenario in city navigation: an AI agent is provided\\nwith language descriptions of the goal location with respect to some well-known\\nlandmarks; By only observing the scene around, including recognizing landmarks\\nand road network connections, the agent has to make decisions to navigate to\\nthe goal location without instructions. This problem is very challenging,\\nbecause it requires agent to establish self-position and acquire spatial\\nrepresentation of complex urban environment, where landmarks are often\\ninvisible. In the absence of navigation instructions, such abilities are vital\\nfor the agent to make high-quality decisions in long-range city navigation.\\nWith the emergent reasoning ability of large language models (LLMs), a tempting\\nbaseline is to prompt LLMs to \\\"react\\\" on each observation and make decisions\\naccordingly. However, this baseline has very poor performance that the agent\\noften repeatedly visits same locations and make short-sighted, inconsistent\\ndecisions. To address these issues, this paper introduces a novel agentic\\nworkflow featured by its abilities to perceive, reflect and plan. Specifically,\\nwe find LLaVA-7B can be fine-tuned to perceive the direction and distance of\\nlandmarks with sufficient accuracy for city navigation. Moreover, reflection is\\nachieved through a memory mechanism, where past experiences are stored and can\\nbe retrieved with current perception for effective decision argumentation.\\nPlanning uses reflection results to produce long-term plans, which can avoid\\nshort-sighted decisions in long-range navigation. We show the designed workflow\\nsignificantly improves navigation ability of the LLM agent compared with the\\nstate-of-the-art baselines.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.04168\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions
This paper considers a scenario in city navigation: an AI agent is provided
with language descriptions of the goal location with respect to some well-known
landmarks; By only observing the scene around, including recognizing landmarks
and road network connections, the agent has to make decisions to navigate to
the goal location without instructions. This problem is very challenging,
because it requires agent to establish self-position and acquire spatial
representation of complex urban environment, where landmarks are often
invisible. In the absence of navigation instructions, such abilities are vital
for the agent to make high-quality decisions in long-range city navigation.
With the emergent reasoning ability of large language models (LLMs), a tempting
baseline is to prompt LLMs to "react" on each observation and make decisions
accordingly. However, this baseline has very poor performance that the agent
often repeatedly visits same locations and make short-sighted, inconsistent
decisions. To address these issues, this paper introduces a novel agentic
workflow featured by its abilities to perceive, reflect and plan. Specifically,
we find LLaVA-7B can be fine-tuned to perceive the direction and distance of
landmarks with sufficient accuracy for city navigation. Moreover, reflection is
achieved through a memory mechanism, where past experiences are stored and can
be retrieved with current perception for effective decision argumentation.
Planning uses reflection results to produce long-term plans, which can avoid
short-sighted decisions in long-range navigation. We show the designed workflow
significantly improves navigation ability of the LLM agent compared with the
state-of-the-art baselines.