Yifan Zhang , Zhiyun Wang , Zhengting He , Jingxuan Li , Gengchen Mai , Jianfeng Lin , Cheng Wei , Wenhao Yu
{"title":"BB-GeoGPT:地理信息科学大语言模型学习框架","authors":"Yifan Zhang , Zhiyun Wang , Zhengting He , Jingxuan Li , Gengchen Mai , Jianfeng Lin , Cheng Wei , Wenhao Yu","doi":"10.1016/j.ipm.2024.103808","DOIUrl":null,"url":null,"abstract":"<div><p>Large language models (LLMs) exhibit impressive capabilities across diverse tasks in natural language processing. Nevertheless, challenges arise such as large model parameter size and limited model accessibility through APIs such as ChatGPT and GPT-4, which prohibits the model deployment on mobile devices and domain adaptation or fine-tuning. Moreover, while LLMs excel in general domains, their performance in specialized fields such as GIS may not always align with the expectations of domain experts. This is primarily attributed to the diverse disciplinary origins of the training data, which often lack comprehensive coverage and treatment of knowledge specific to individual disciplines (e.g., GIS). Therefore, there is a crucial need to train and adapt LLMs specifically designed for different professional fields. In this paper, our focus is on the GIS domain, where we introduce BB(BaBy)-GeoGPT, a large language model with GIS-specific knowledge. To achieve this goal, we curated a comprehensive set of resources, comprising model pretraining data (BB-GeoPT, 26,907 documents), supervised fine-tuning data (BB-GeoSFT, 35,876 instructions), and evaluation data (BB-GeoEval, 600 objective questions and 150 subjective questions). BB-GeoGPT is developed by first adapting an open-source general-domain LLM, the LLaMA-2-7B model, to our pretraining data. Subsequently, we use instruction tuning to further fine-tune the model on our BB-GeoSFT. Through extensive experiments on the evaluation dataset, BB-GeoGPT demonstrates improvements ranging from 10.55% to 47.57% for objective questions and from 7.87% to 27.73% for subjective questions, when compared to general LLMs of similar size in terms of accuracy. Moreover, our data collection strategy and the amassed data can serve as a foundation for advancing LLM research in the GIS domain, fostering further development.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":null,"pages":null},"PeriodicalIF":7.4000,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BB-GeoGPT: A framework for learning a large language model for geographic information science\",\"authors\":\"Yifan Zhang , Zhiyun Wang , Zhengting He , Jingxuan Li , Gengchen Mai , Jianfeng Lin , Cheng Wei , Wenhao Yu\",\"doi\":\"10.1016/j.ipm.2024.103808\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Large language models (LLMs) exhibit impressive capabilities across diverse tasks in natural language processing. Nevertheless, challenges arise such as large model parameter size and limited model accessibility through APIs such as ChatGPT and GPT-4, which prohibits the model deployment on mobile devices and domain adaptation or fine-tuning. Moreover, while LLMs excel in general domains, their performance in specialized fields such as GIS may not always align with the expectations of domain experts. This is primarily attributed to the diverse disciplinary origins of the training data, which often lack comprehensive coverage and treatment of knowledge specific to individual disciplines (e.g., GIS). Therefore, there is a crucial need to train and adapt LLMs specifically designed for different professional fields. In this paper, our focus is on the GIS domain, where we introduce BB(BaBy)-GeoGPT, a large language model with GIS-specific knowledge. To achieve this goal, we curated a comprehensive set of resources, comprising model pretraining data (BB-GeoPT, 26,907 documents), supervised fine-tuning data (BB-GeoSFT, 35,876 instructions), and evaluation data (BB-GeoEval, 600 objective questions and 150 subjective questions). BB-GeoGPT is developed by first adapting an open-source general-domain LLM, the LLaMA-2-7B model, to our pretraining data. Subsequently, we use instruction tuning to further fine-tune the model on our BB-GeoSFT. Through extensive experiments on the evaluation dataset, BB-GeoGPT demonstrates improvements ranging from 10.55% to 47.57% for objective questions and from 7.87% to 27.73% for subjective questions, when compared to general LLMs of similar size in terms of accuracy. Moreover, our data collection strategy and the amassed data can serve as a foundation for advancing LLM research in the GIS domain, fostering further development.</p></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-06-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324001675\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324001675","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
BB-GeoGPT: A framework for learning a large language model for geographic information science
Large language models (LLMs) exhibit impressive capabilities across diverse tasks in natural language processing. Nevertheless, challenges arise such as large model parameter size and limited model accessibility through APIs such as ChatGPT and GPT-4, which prohibits the model deployment on mobile devices and domain adaptation or fine-tuning. Moreover, while LLMs excel in general domains, their performance in specialized fields such as GIS may not always align with the expectations of domain experts. This is primarily attributed to the diverse disciplinary origins of the training data, which often lack comprehensive coverage and treatment of knowledge specific to individual disciplines (e.g., GIS). Therefore, there is a crucial need to train and adapt LLMs specifically designed for different professional fields. In this paper, our focus is on the GIS domain, where we introduce BB(BaBy)-GeoGPT, a large language model with GIS-specific knowledge. To achieve this goal, we curated a comprehensive set of resources, comprising model pretraining data (BB-GeoPT, 26,907 documents), supervised fine-tuning data (BB-GeoSFT, 35,876 instructions), and evaluation data (BB-GeoEval, 600 objective questions and 150 subjective questions). BB-GeoGPT is developed by first adapting an open-source general-domain LLM, the LLaMA-2-7B model, to our pretraining data. Subsequently, we use instruction tuning to further fine-tune the model on our BB-GeoSFT. Through extensive experiments on the evaluation dataset, BB-GeoGPT demonstrates improvements ranging from 10.55% to 47.57% for objective questions and from 7.87% to 27.73% for subjective questions, when compared to general LLMs of similar size in terms of accuracy. Moreover, our data collection strategy and the amassed data can serve as a foundation for advancing LLM research in the GIS domain, fostering further development.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.