Xukai Zhao , He Huang , Tao Yang , Yuxing Lu , Lu Zhang , Ruoyu Wang , Zhengliang Liu , Tianyang Zhong , Tianming Liu
{"title":"Urban planning in the age of large language models: Assessing OpenAI o1's performance and capabilities across 556 tasks","authors":"Xukai Zhao , He Huang , Tao Yang , Yuxing Lu , Lu Zhang , Ruoyu Wang , Zhengliang Liu , Tianyang Zhong , Tianming Liu","doi":"10.1016/j.compenvurbsys.2025.102332","DOIUrl":null,"url":null,"abstract":"<div><div>Integrating Large Language Models (LLMs) into urban planning presents significant opportunities to enhance efficiency and support data-driven city development strategies. Despite their potential, the specific capabilities of LLMs within the urban planning context remain underexplored, and the field lacks standardized benchmarks for systematic evaluation. This study presents the first comprehensive evaluation focused on OpenAI o1's performance and capabilities in urban planning, systematically benchmarking it against GPT-3.5 and GPT-4o using an original open-source benchmark comprising 556 tasks across five critical categories: urban planning documentation, examinations, routine data analysis, AI algorithm support, and thesis writing. Through rigorous testing and manual analysis of 170,627 words of generated output, OpenAI o1 consistently outperformed its counterparts, achieving an average performance score of 84.08 % compared to 69.30 % for GPT-4o and 45.27 % for GPT-3.5. Our findings highlight o1's strengths in domain knowledge mastery, basic operational competence, and coding capabilities, demonstrating its potential applications in information retrieval, urban data analytics, planning decision support, educational assistance, and LLM-based agent development. However, significant limitations were identified, including inability in urban design, susceptibility to fabricating information, moderate academic writing quality, challenges in high-level professional examinations, and spatial reasoning, and limited support for specialized or emerging AI algorithms. Future optimizations should prioritize enhancing multimodal integration, implementing robust validation mechanisms, adopting adaptive learning strategies, and enabling domain-specific fine-tuning to meet urban planners' specialized needs. These advancements would enable LLMs to better support the evolving demands of urban planning, allowing professionals to focus more on strategic decision-making and the creative aspects of the field.</div></div>","PeriodicalId":48241,"journal":{"name":"Computers Environment and Urban Systems","volume":"121 ","pages":"Article 102332"},"PeriodicalIF":8.3000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers Environment and Urban Systems","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0198971525000857","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL STUDIES","Score":null,"Total":0}
引用次数: 0
Abstract
Integrating Large Language Models (LLMs) into urban planning presents significant opportunities to enhance efficiency and support data-driven city development strategies. Despite their potential, the specific capabilities of LLMs within the urban planning context remain underexplored, and the field lacks standardized benchmarks for systematic evaluation. This study presents the first comprehensive evaluation focused on OpenAI o1's performance and capabilities in urban planning, systematically benchmarking it against GPT-3.5 and GPT-4o using an original open-source benchmark comprising 556 tasks across five critical categories: urban planning documentation, examinations, routine data analysis, AI algorithm support, and thesis writing. Through rigorous testing and manual analysis of 170,627 words of generated output, OpenAI o1 consistently outperformed its counterparts, achieving an average performance score of 84.08 % compared to 69.30 % for GPT-4o and 45.27 % for GPT-3.5. Our findings highlight o1's strengths in domain knowledge mastery, basic operational competence, and coding capabilities, demonstrating its potential applications in information retrieval, urban data analytics, planning decision support, educational assistance, and LLM-based agent development. However, significant limitations were identified, including inability in urban design, susceptibility to fabricating information, moderate academic writing quality, challenges in high-level professional examinations, and spatial reasoning, and limited support for specialized or emerging AI algorithms. Future optimizations should prioritize enhancing multimodal integration, implementing robust validation mechanisms, adopting adaptive learning strategies, and enabling domain-specific fine-tuning to meet urban planners' specialized needs. These advancements would enable LLMs to better support the evolving demands of urban planning, allowing professionals to focus more on strategic decision-making and the creative aspects of the field.
期刊介绍:
Computers, Environment and Urban Systemsis an interdisciplinary journal publishing cutting-edge and innovative computer-based research on environmental and urban systems, that privileges the geospatial perspective. The journal welcomes original high quality scholarship of a theoretical, applied or technological nature, and provides a stimulating presentation of perspectives, research developments, overviews of important new technologies and uses of major computational, information-based, and visualization innovations. Applied and theoretical contributions demonstrate the scope of computer-based analysis fostering a better understanding of environmental and urban systems, their spatial scope and their dynamics.