Evaluating and Advancing Large Language Models for Water Knowledge Tasks in Engineering and Research

IF 8.9 2区环境科学与生态学 Q1 ENGINEERING, ENVIRONMENTAL

Environmental Science & Technology Letters Environ. Pub Date : 2025-02-20 DOI:10.1021/acs.estlett.5c0003810.1021/acs.estlett.5c00038

Boyan Xu, Zihao Li, Yuxin Yang, Guanlan Wu, Chengzhi Wang, Xiongpeng Tang, Yu Li, Zihao Wu, Qingxian Su, Xueqing Shi, Yue Yang, Rui Tong, Liang Wen* and How Yong Ng*,

{"title":"Evaluating and Advancing Large Language Models for Water Knowledge Tasks in Engineering and Research","authors":"Boyan Xu, Zihao Li, Yuxin Yang, Guanlan Wu, Chengzhi Wang, Xiongpeng Tang, Yu Li, Zihao Wu, Qingxian Su, Xueqing Shi, Yue Yang, Rui Tong, Liang Wen* and How Yong Ng*, ","doi":"10.1021/acs.estlett.5c0003810.1021/acs.estlett.5c00038","DOIUrl":null,"url":null,"abstract":"<p >Although large language models (LLMs) have demonstrated significant value in numerous fields, there remains limited research on evaluating their performance or enhancing their capabilities within water science and technology. This study initially assessed the performance of eight foundational models (i.e., GPT-4, GPT-3.5, Gemini, GLM-4, ERNIE, QWEN, Llama3-8B, and Llama3-70B) on a wide range of water knowledge tasks in engineering and research by developing an evaluation suite called WaterER (i.e., 1043 tasks). GPT-4 was demonstrated to excel in diverse water knowledge tasks in engineering and research. Llama3-70B was best for Chinese engineering queries, while Chinese-oriented models outperformed GPT-3.5 in English engineering tasks. Gemini demonstrated specialized academic capabilities in wastewater treatment, environmental restoration, drinking water treatment, sanitation, anaerobic digestion, and contaminants. To further advance LLMs, we employed prompt engineering (i.e., five-shot learning) and fine-tuned open-sourced Llama3-8B into a specialized model, namely, WaterGPT. WaterGPT exhibited enhanced reasoning capabilities, outperforming Llama3-8B by over 135.4% on English engineering tasks and 18.8% on research tasks. Additionally, fine-tuning proved to be more reliable and effective than prompt engineering. Collectively, this study established various LLMs’ baseline performance in water sectors while highlighting the robust evaluation frameworks and augmentation techniques to ensure the effective and reliable use of LLMs.</p>","PeriodicalId":37,"journal":{"name":"Environmental Science & Technology Letters Environ.","volume":"12 3","pages":"289–296 289–296"},"PeriodicalIF":8.9000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Science & Technology Letters Environ.","FirstCategoryId":"1","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.estlett.5c00038","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Although large language models (LLMs) have demonstrated significant value in numerous fields, there remains limited research on evaluating their performance or enhancing their capabilities within water science and technology. This study initially assessed the performance of eight foundational models (i.e., GPT-4, GPT-3.5, Gemini, GLM-4, ERNIE, QWEN, Llama3-8B, and Llama3-70B) on a wide range of water knowledge tasks in engineering and research by developing an evaluation suite called WaterER (i.e., 1043 tasks). GPT-4 was demonstrated to excel in diverse water knowledge tasks in engineering and research. Llama3-70B was best for Chinese engineering queries, while Chinese-oriented models outperformed GPT-3.5 in English engineering tasks. Gemini demonstrated specialized academic capabilities in wastewater treatment, environmental restoration, drinking water treatment, sanitation, anaerobic digestion, and contaminants. To further advance LLMs, we employed prompt engineering (i.e., five-shot learning) and fine-tuned open-sourced Llama3-8B into a specialized model, namely, WaterGPT. WaterGPT exhibited enhanced reasoning capabilities, outperforming Llama3-8B by over 135.4% on English engineering tasks and 18.8% on research tasks. Additionally, fine-tuning proved to be more reliable and effective than prompt engineering. Collectively, this study established various LLMs’ baseline performance in water sectors while highlighting the robust evaluation frameworks and augmentation techniques to ensure the effective and reliable use of LLMs.

Abstract Image

查看原文本刊更多论文

评估和推进工程和研究中水知识任务的大型语言模型

尽管大型语言模型（llm）已经在许多领域展示了重要的价值，但在水科学和技术中评估其性能或增强其能力的研究仍然有限。本研究通过开发名为WaterER的评估套件（即1043项任务），初步评估了8个基础模型（即GPT-4、GPT-3.5、Gemini、GLM-4、ERNIE、QWEN、Llama3-8B和Llama3-70B）在工程和研究中广泛的水知识任务上的性能。GPT-4在工程和研究中的各种水知识任务中表现出色。Llama3-70B在中文工程查询中表现最好，而面向中文的模型在英文工程任务中表现优于GPT-3.5。Gemini展示了在废水处理、环境恢复、饮用水处理、卫生、厌氧消化和污染物方面的专业学术能力。为了进一步推进llm，我们采用了即时工程（即五次学习），并将开源的Llama3-8B微调成一个专门的模型，即WaterGPT。WaterGPT表现出更强的推理能力，在英语工程任务和研究任务上的表现分别超过Llama3-8B 135.4%和18.8%。此外，事实证明，微调比即时工程更可靠、更有效。总的来说，本研究建立了各种llm在水务部门的基线绩效，同时强调了强大的评估框架和增强技术，以确保llm的有效和可靠使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Environmental Science & Technology Letters Environ. ENGINEERING, ENVIRONMENTALENVIRONMENTAL SC-ENVIRONMENTAL SCIENCES

CiteScore

17.90

自引率

3.70%

发文量

163

期刊介绍： Environmental Science & Technology Letters serves as an international forum for brief communications on experimental or theoretical results of exceptional timeliness in all aspects of environmental science, both pure and applied. Published as soon as accepted, these communications are summarized in monthly issues. Additionally, the journal features short reviews on emerging topics in environmental science and technology.