Streamlining geoscience data analysis with an LLM-driven workflow

IF 3.2 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Applied Computing and Geosciences Pub Date : 2025-02-01 DOI:10.1016/j.acags.2024.100218

Jiyin Zhang, Cory Clairmont, Xiang Que, Wenjia Li, Weilin Chen, Chenhao Li, Xiaogang Ma

{"title":"Streamlining geoscience data analysis with an LLM-driven workflow","authors":"Jiyin Zhang, Cory Clairmont, Xiang Que, Wenjia Li, Weilin Chen, Chenhao Li, Xiaogang Ma","doi":"10.1016/j.acags.2024.100218","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Models (LLMs) have made significant advancements in natural language processing and human-like response generation. However, training and fine-tuning an LLM to fit the strict requirements in the scope of academic research, such as geoscience, still requires significant computational resources and human expert alignment to ensure the quality and reliability of the generated content. The challenges highlight the need for a more flexible and reliable LLM workflow to meet domain-specific analysis needs. This study proposes an LLM-driven workflow that addresses the challenges of utilizing LLMs in geoscience data analysis. The work was built upon the open data API (application programming interface) of Mindat, one of the largest databases in mineralogy. We designed and developed an open-source LLM-driven workflow that processes natural language requests and automatically utilizes the Mindat API, mineral co-occurrence network analysis, and locality distribution heat map visualization to conduct geoscience data analysis tasks. Using prompt engineering techniques, we developed a supervisor-based agentic framework that enables LLM agents to not only interpret context information but also autonomously addressing complex geoscience analysis tasks, bridging the gap between automated workflows and human expertise. This agentic design emphasizes autonomy, allowing the workflow to adapt seamlessly to future advancements in LLM capabilities without requiring additional fine-tuning or domain-specific embedding. By providing the comprehensive context of the task in the workflow and the professional tool, we ensure the quality of LLM-generated content without the need to embed geoscience knowledge into LLMs through fine-tuning or human alignment. Our approach integrates LLMs into geoscience data analysis, addressing the need for specialized tools while reducing the learning curve through LLM-driven interactions between users and APIs. This streamlined workflow enhances the efficiency of exploratory data analysis, as demonstrated by the several use cases presented. In our future work we will explore the scalability of this workflow through the integration of additional agents and diverse geoscience data sources.</div></div>","PeriodicalId":33804,"journal":{"name":"Applied Computing and Geosciences","volume":"25 ","pages":"Article 100218"},"PeriodicalIF":3.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S259019742400065X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have made significant advancements in natural language processing and human-like response generation. However, training and fine-tuning an LLM to fit the strict requirements in the scope of academic research, such as geoscience, still requires significant computational resources and human expert alignment to ensure the quality and reliability of the generated content. The challenges highlight the need for a more flexible and reliable LLM workflow to meet domain-specific analysis needs. This study proposes an LLM-driven workflow that addresses the challenges of utilizing LLMs in geoscience data analysis. The work was built upon the open data API (application programming interface) of Mindat, one of the largest databases in mineralogy. We designed and developed an open-source LLM-driven workflow that processes natural language requests and automatically utilizes the Mindat API, mineral co-occurrence network analysis, and locality distribution heat map visualization to conduct geoscience data analysis tasks. Using prompt engineering techniques, we developed a supervisor-based agentic framework that enables LLM agents to not only interpret context information but also autonomously addressing complex geoscience analysis tasks, bridging the gap between automated workflows and human expertise. This agentic design emphasizes autonomy, allowing the workflow to adapt seamlessly to future advancements in LLM capabilities without requiring additional fine-tuning or domain-specific embedding. By providing the comprehensive context of the task in the workflow and the professional tool, we ensure the quality of LLM-generated content without the need to embed geoscience knowledge into LLMs through fine-tuning or human alignment. Our approach integrates LLMs into geoscience data analysis, addressing the need for specialized tools while reducing the learning curve through LLM-driven interactions between users and APIs. This streamlined workflow enhances the efficiency of exploratory data analysis, as demonstrated by the several use cases presented. In our future work we will explore the scalability of this workflow through the integration of additional agents and diverse geoscience data sources.

查看原文本刊更多论文

通过llm驱动的工作流程简化地球科学数据分析

大型语言模型（llm）在自然语言处理和类人反应生成方面取得了重大进展。然而，培训和微调法学硕士以适应学术研究（如地球科学）范围内的严格要求，仍然需要大量的计算资源和人类专家校准，以确保生成内容的质量和可靠性。这些挑战突出了对更灵活和可靠的LLM工作流的需求，以满足特定领域的分析需求。本研究提出了一个法学硕士驱动的工作流程，解决了在地球科学数据分析中利用法学硕士的挑战。这项工作是建立在Mindat的开放数据API（应用程序编程接口）上的，Mindat是矿物学领域最大的数据库之一。我们设计并开发了一个开源的llm驱动的工作流程，它可以处理自然语言请求，并自动利用Mindat API、矿物共生网络分析和局部分布热图可视化来执行地球科学数据分析任务。利用即时工程技术，我们开发了一个基于监督的代理框架，使LLM代理不仅可以解释上下文信息，还可以自主处理复杂的地球科学分析任务，弥合自动化工作流程与人类专业知识之间的差距。这种代理设计强调自主性，允许工作流无缝地适应LLM功能的未来发展，而无需额外的微调或特定领域的嵌入。通过提供工作流程中任务的全面背景和专业工具，我们确保了法学硕士生成内容的质量，而无需通过微调或人工校准将地球科学知识嵌入法学硕士。我们的方法将llm集成到地球科学数据分析中，解决了对专业工具的需求，同时通过llm驱动的用户和api之间的交互减少了学习曲线。这个简化的工作流程提高了探索性数据分析的效率，正如所提供的几个用例所证明的那样。在未来的工作中，我们将通过集成其他代理和各种地球科学数据源来探索该工作流的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊