Evaluating and Advancing Large Language Models for Water Knowledge Tasks in Engineering and Research

IF 8.9 2区 环境科学与生态学 Q1 ENGINEERING, ENVIRONMENTAL
Boyan Xu, Zihao Li, Yuxin Yang, Guanlan Wu, Chengzhi Wang, Xiongpeng Tang, Yu Li, Zihao Wu, Qingxian Su, Xueqing Shi, Yue Yang, Rui Tong, Liang Wen* and How Yong Ng*, 
{"title":"Evaluating and Advancing Large Language Models for Water Knowledge Tasks in Engineering and Research","authors":"Boyan Xu,&nbsp;Zihao Li,&nbsp;Yuxin Yang,&nbsp;Guanlan Wu,&nbsp;Chengzhi Wang,&nbsp;Xiongpeng Tang,&nbsp;Yu Li,&nbsp;Zihao Wu,&nbsp;Qingxian Su,&nbsp;Xueqing Shi,&nbsp;Yue Yang,&nbsp;Rui Tong,&nbsp;Liang Wen* and How Yong Ng*,&nbsp;","doi":"10.1021/acs.estlett.5c0003810.1021/acs.estlett.5c00038","DOIUrl":null,"url":null,"abstract":"<p >Although large language models (LLMs) have demonstrated significant value in numerous fields, there remains limited research on evaluating their performance or enhancing their capabilities within water science and technology. This study initially assessed the performance of eight foundational models (i.e., GPT-4, GPT-3.5, Gemini, GLM-4, ERNIE, QWEN, Llama3-8B, and Llama3-70B) on a wide range of water knowledge tasks in engineering and research by developing an evaluation suite called WaterER (i.e., 1043 tasks). GPT-4 was demonstrated to excel in diverse water knowledge tasks in engineering and research. Llama3-70B was best for Chinese engineering queries, while Chinese-oriented models outperformed GPT-3.5 in English engineering tasks. Gemini demonstrated specialized academic capabilities in wastewater treatment, environmental restoration, drinking water treatment, sanitation, anaerobic digestion, and contaminants. To further advance LLMs, we employed prompt engineering (i.e., five-shot learning) and fine-tuned open-sourced Llama3-8B into a specialized model, namely, WaterGPT. WaterGPT exhibited enhanced reasoning capabilities, outperforming Llama3-8B by over 135.4% on English engineering tasks and 18.8% on research tasks. Additionally, fine-tuning proved to be more reliable and effective than prompt engineering. Collectively, this study established various LLMs’ baseline performance in water sectors while highlighting the robust evaluation frameworks and augmentation techniques to ensure the effective and reliable use of LLMs.</p>","PeriodicalId":37,"journal":{"name":"Environmental Science & Technology Letters Environ.","volume":"12 3","pages":"289–296 289–296"},"PeriodicalIF":8.9000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Science & Technology Letters Environ.","FirstCategoryId":"1","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.estlett.5c00038","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Although large language models (LLMs) have demonstrated significant value in numerous fields, there remains limited research on evaluating their performance or enhancing their capabilities within water science and technology. This study initially assessed the performance of eight foundational models (i.e., GPT-4, GPT-3.5, Gemini, GLM-4, ERNIE, QWEN, Llama3-8B, and Llama3-70B) on a wide range of water knowledge tasks in engineering and research by developing an evaluation suite called WaterER (i.e., 1043 tasks). GPT-4 was demonstrated to excel in diverse water knowledge tasks in engineering and research. Llama3-70B was best for Chinese engineering queries, while Chinese-oriented models outperformed GPT-3.5 in English engineering tasks. Gemini demonstrated specialized academic capabilities in wastewater treatment, environmental restoration, drinking water treatment, sanitation, anaerobic digestion, and contaminants. To further advance LLMs, we employed prompt engineering (i.e., five-shot learning) and fine-tuned open-sourced Llama3-8B into a specialized model, namely, WaterGPT. WaterGPT exhibited enhanced reasoning capabilities, outperforming Llama3-8B by over 135.4% on English engineering tasks and 18.8% on research tasks. Additionally, fine-tuning proved to be more reliable and effective than prompt engineering. Collectively, this study established various LLMs’ baseline performance in water sectors while highlighting the robust evaluation frameworks and augmentation techniques to ensure the effective and reliable use of LLMs.

Abstract Image

求助全文
约1分钟内获得全文 求助全文
来源期刊
Environmental Science & Technology Letters Environ.
Environmental Science & Technology Letters Environ. ENGINEERING, ENVIRONMENTALENVIRONMENTAL SC-ENVIRONMENTAL SCIENCES
CiteScore
17.90
自引率
3.70%
发文量
163
期刊介绍: Environmental Science & Technology Letters serves as an international forum for brief communications on experimental or theoretical results of exceptional timeliness in all aspects of environmental science, both pure and applied. Published as soon as accepted, these communications are summarized in monthly issues. Additionally, the journal features short reviews on emerging topics in environmental science and technology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信