Geo-FuB: A method for constructing an Operator-Function knowledge base for geospatial code generation with large language models

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-04-24 DOI:10.1016/j.knosys.2025.113624

Shuyang Hou, Anqi Zhao, Jianyuan Liang, Zhangxiao Shen, Huayi Wu

{"title":"Geo-FuB: A method for constructing an Operator-Function knowledge base for geospatial code generation with large language models","authors":"Shuyang Hou, Anqi Zhao, Jianyuan Liang, Zhangxiao Shen, Huayi Wu","doi":"10.1016/j.knosys.2025.113624","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid growth of spatiotemporal data and the increasing demand for geospatial modeling have driven the automation of these tasks with large language models (LLMs) to enhance research efficiency. However, general LLMs often encounter hallucinations when generating geospatial code due to a lack of domain-specific knowledge on geospatial functions and related operators. The retrieval-augmented generation (RAG) technique, integrated with an external operator-function knowledge base, provides an effective solution to this challenge. To date, no widely recognized framework exists for building such a knowledge base. This study presents a comprehensive framework for constructing the operator-function knowledge base, leveraging semantic and structural knowledge embedded in geospatial scripts. The framework consists of three core components: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Combination and Semantic Framework Mapping (Geo-FuM). Geo-FuSE employs techniques like Chain-of-Thought (CoT), TF-IDF, t-SNE, and Gaussian Mixture Models (GMM) to extract semantic features from scripts; Geo-FuST uses Abstract Syntax Trees (AST) and the Apriori algorithm to identify frequent operator combinations; Geo-FuM combines LLMs with a fuzzy matching algorithm to align these combinations with the semantic framework, forming the Geo-FuB knowledge base. The instance of Geo-FuB, named GEE-FuB, has been developed using 154,075 Google Earth Engine scripts and is available at <span><span>https://github.com/whuhsy/GEE-FuB</span><svg><path></path></svg></span>. Based on a set of well-defined evaluation metrics introduced in this study, the GEE-FuB construction achieved an overall accuracy of 88.89 %, demonstrating a 31 % to 34 % reduction in hallucinations compared to mainstream LLMs without external knowledge integration. This research introduces a novel approach to knowledge mining and knowledge base construction specifically tailored for geospatial code generation tasks, broadening the applications of knowledge base construction and providing valuable theoretical insights, practical examples, and data resources for related research fields.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"319 ","pages":"Article 113624"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125006707","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid growth of spatiotemporal data and the increasing demand for geospatial modeling have driven the automation of these tasks with large language models (LLMs) to enhance research efficiency. However, general LLMs often encounter hallucinations when generating geospatial code due to a lack of domain-specific knowledge on geospatial functions and related operators. The retrieval-augmented generation (RAG) technique, integrated with an external operator-function knowledge base, provides an effective solution to this challenge. To date, no widely recognized framework exists for building such a knowledge base. This study presents a comprehensive framework for constructing the operator-function knowledge base, leveraging semantic and structural knowledge embedded in geospatial scripts. The framework consists of three core components: Function Semantic Framework Construction (Geo-FuSE), Frequent Operator Combination Statistics (Geo-FuST), and Combination and Semantic Framework Mapping (Geo-FuM). Geo-FuSE employs techniques like Chain-of-Thought (CoT), TF-IDF, t-SNE, and Gaussian Mixture Models (GMM) to extract semantic features from scripts; Geo-FuST uses Abstract Syntax Trees (AST) and the Apriori algorithm to identify frequent operator combinations; Geo-FuM combines LLMs with a fuzzy matching algorithm to align these combinations with the semantic framework, forming the Geo-FuB knowledge base. The instance of Geo-FuB, named GEE-FuB, has been developed using 154,075 Google Earth Engine scripts and is available at https://github.com/whuhsy/GEE-FuB. Based on a set of well-defined evaluation metrics introduced in this study, the GEE-FuB construction achieved an overall accuracy of 88.89 %, demonstrating a 31 % to 34 % reduction in hallucinations compared to mainstream LLMs without external knowledge integration. This research introduces a novel approach to knowledge mining and knowledge base construction specifically tailored for geospatial code generation tasks, broadening the applications of knowledge base construction and providing valuable theoretical insights, practical examples, and data resources for related research fields.

查看原文本刊更多论文

Geo-FuB：一种构造算子函数知识库的方法，用于大型语言模型的地理空间代码生成

时空数据的快速增长和对地理空间建模需求的不断增长，推动了大型语言模型（llm）对这些任务的自动化，以提高研究效率。然而，由于缺乏对地理空间函数和相关操作符的特定领域知识，一般法学硕士在生成地理空间代码时经常会遇到幻觉。与外部算子函数知识库集成的检索增强生成（RAG）技术为解决这一挑战提供了有效的解决方案。到目前为止，还没有一个得到广泛认可的框架来构建这样的知识库。本研究提出了一个综合的框架来构建算子函数知识库，利用嵌入在地理空间脚本中的语义和结构知识。该框架由功能语义框架构建（Geo-FuSE）、频繁算子组合统计（Geo-FuST）和组合与语义框架映射（Geo-FuM）三个核心部分组成。Geo-FuSE采用思维链（CoT）、TF-IDF、t-SNE和高斯混合模型（GMM）等技术从脚本中提取语义特征；Geo-FuST使用抽象语法树（AST）和Apriori算法来识别频繁算子组合；Geo-FuM将llm与模糊匹配算法结合起来，将这些组合与语义框架对齐，形成Geo-FuB知识库。Geo-FuB实例名为GEE-FuB，是使用154,075谷歌地球引擎脚本开发的，可在https://github.com/whuhsy/GEE-FuB上获得。基于本研究中引入的一组定义良好的评估指标，GEE-FuB构建的总体准确率达到了88.89%，与没有外部知识整合的主流法学硕士相比，幻觉减少了31%至34%。本研究提出了一种针对地理空间代码生成任务的知识挖掘和知识库构建的新方法，拓宽了知识库构建的应用领域，为相关研究领域提供了有价值的理论见解、实践实例和数据资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.