{"title":"Geo-FuB:一种构造算子函数知识库的方法,用于大型语言模型的地理空间代码生成","authors":"Shuyang Hou, Anqi Zhao, Jianyuan Liang, Zhangxiao Shen, Huayi Wu","doi":"10.1016/j.knosys.2025.113624","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid growth of spatiotemporal data and the increasing demand for geospatial modeling have driven the automation of these tasks with large language models (LLMs) to enhance research efficiency. However, general LLMs often encounter hallucinations when generating geospatial code due to a lack of domain-specific knowledge on geospatial functions and related operators. The retrieval-augmented generation (RAG) technique, integrated with an external operator-function knowledge base, provides an effective solution to this challenge. To date, no widely recognized framework exists for building such a knowledge base. This study presents a comprehensive framework for constructing the operator-function knowledge base, leveraging semantic and structural knowledge embedded in geospatial scripts. The framework consists of three core components: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Combination and Semantic Framework Mapping (Geo-FuM). Geo-FuSE employs techniques like Chain-of-Thought (CoT), TF-IDF, t-SNE, and Gaussian Mixture Models (GMM) to extract semantic features from scripts; Geo-FuST uses Abstract Syntax Trees (AST) and the Apriori algorithm to identify frequent operator combinations; Geo-FuM combines LLMs with a fuzzy matching algorithm to align these combinations with the semantic framework, forming the Geo-FuB knowledge base. The instance of Geo-FuB, named GEE-FuB, has been developed using 154,075 Google Earth Engine scripts and is available at <span><span>https://github.com/whuhsy/GEE-FuB</span><svg><path></path></svg></span>. Based on a set of well-defined evaluation metrics introduced in this study, the GEE-FuB construction achieved an overall accuracy of 88.89 %, demonstrating a 31 % to 34 % reduction in hallucinations compared to mainstream LLMs without external knowledge integration. This research introduces a novel approach to knowledge mining and knowledge base construction specifically tailored for geospatial code generation tasks, broadening the applications of knowledge base construction and providing valuable theoretical insights, practical examples, and data resources for related research fields.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"319 ","pages":"Article 113624"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Geo-FuB: A method for constructing an Operator-Function knowledge base for geospatial code generation with large language models\",\"authors\":\"Shuyang Hou, Anqi Zhao, Jianyuan Liang, Zhangxiao Shen, Huayi Wu\",\"doi\":\"10.1016/j.knosys.2025.113624\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The rapid growth of spatiotemporal data and the increasing demand for geospatial modeling have driven the automation of these tasks with large language models (LLMs) to enhance research efficiency. However, general LLMs often encounter hallucinations when generating geospatial code due to a lack of domain-specific knowledge on geospatial functions and related operators. The retrieval-augmented generation (RAG) technique, integrated with an external operator-function knowledge base, provides an effective solution to this challenge. To date, no widely recognized framework exists for building such a knowledge base. This study presents a comprehensive framework for constructing the operator-function knowledge base, leveraging semantic and structural knowledge embedded in geospatial scripts. The framework consists of three core components: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Combination and Semantic Framework Mapping (Geo-FuM). Geo-FuSE employs techniques like Chain-of-Thought (CoT), TF-IDF, t-SNE, and Gaussian Mixture Models (GMM) to extract semantic features from scripts; Geo-FuST uses Abstract Syntax Trees (AST) and the Apriori algorithm to identify frequent operator combinations; Geo-FuM combines LLMs with a fuzzy matching algorithm to align these combinations with the semantic framework, forming the Geo-FuB knowledge base. The instance of Geo-FuB, named GEE-FuB, has been developed using 154,075 Google Earth Engine scripts and is available at <span><span>https://github.com/whuhsy/GEE-FuB</span><svg><path></path></svg></span>. Based on a set of well-defined evaluation metrics introduced in this study, the GEE-FuB construction achieved an overall accuracy of 88.89 %, demonstrating a 31 % to 34 % reduction in hallucinations compared to mainstream LLMs without external knowledge integration. This research introduces a novel approach to knowledge mining and knowledge base construction specifically tailored for geospatial code generation tasks, broadening the applications of knowledge base construction and providing valuable theoretical insights, practical examples, and data resources for related research fields.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"319 \",\"pages\":\"Article 113624\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-04-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125006707\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125006707","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Geo-FuB: A method for constructing an Operator-Function knowledge base for geospatial code generation with large language models
The rapid growth of spatiotemporal data and the increasing demand for geospatial modeling have driven the automation of these tasks with large language models (LLMs) to enhance research efficiency. However, general LLMs often encounter hallucinations when generating geospatial code due to a lack of domain-specific knowledge on geospatial functions and related operators. The retrieval-augmented generation (RAG) technique, integrated with an external operator-function knowledge base, provides an effective solution to this challenge. To date, no widely recognized framework exists for building such a knowledge base. This study presents a comprehensive framework for constructing the operator-function knowledge base, leveraging semantic and structural knowledge embedded in geospatial scripts. The framework consists of three core components: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Combination and Semantic Framework Mapping (Geo-FuM). Geo-FuSE employs techniques like Chain-of-Thought (CoT), TF-IDF, t-SNE, and Gaussian Mixture Models (GMM) to extract semantic features from scripts; Geo-FuST uses Abstract Syntax Trees (AST) and the Apriori algorithm to identify frequent operator combinations; Geo-FuM combines LLMs with a fuzzy matching algorithm to align these combinations with the semantic framework, forming the Geo-FuB knowledge base. The instance of Geo-FuB, named GEE-FuB, has been developed using 154,075 Google Earth Engine scripts and is available at https://github.com/whuhsy/GEE-FuB. Based on a set of well-defined evaluation metrics introduced in this study, the GEE-FuB construction achieved an overall accuracy of 88.89 %, demonstrating a 31 % to 34 % reduction in hallucinations compared to mainstream LLMs without external knowledge integration. This research introduces a novel approach to knowledge mining and knowledge base construction specifically tailored for geospatial code generation tasks, broadening the applications of knowledge base construction and providing valuable theoretical insights, practical examples, and data resources for related research fields.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.