Boyan Xu, Zihao Li, Yuxin Yang, Guanlan Wu, Chengzhi Wang, Xiongpeng Tang, Yu Li, Zihao Wu, Qingxian Su, Xueqing Shi, Yue Yang, Rui Tong, Liang Wen* and How Yong Ng*,
{"title":"Evaluating and Advancing Large Language Models for Water Knowledge Tasks in Engineering and Research","authors":"Boyan Xu, Zihao Li, Yuxin Yang, Guanlan Wu, Chengzhi Wang, Xiongpeng Tang, Yu Li, Zihao Wu, Qingxian Su, Xueqing Shi, Yue Yang, Rui Tong, Liang Wen* and How Yong Ng*, ","doi":"10.1021/acs.estlett.5c0003810.1021/acs.estlett.5c00038","DOIUrl":null,"url":null,"abstract":"<p >Although large language models (LLMs) have demonstrated significant value in numerous fields, there remains limited research on evaluating their performance or enhancing their capabilities within water science and technology. This study initially assessed the performance of eight foundational models (i.e., GPT-4, GPT-3.5, Gemini, GLM-4, ERNIE, QWEN, Llama3-8B, and Llama3-70B) on a wide range of water knowledge tasks in engineering and research by developing an evaluation suite called WaterER (i.e., 1043 tasks). GPT-4 was demonstrated to excel in diverse water knowledge tasks in engineering and research. Llama3-70B was best for Chinese engineering queries, while Chinese-oriented models outperformed GPT-3.5 in English engineering tasks. Gemini demonstrated specialized academic capabilities in wastewater treatment, environmental restoration, drinking water treatment, sanitation, anaerobic digestion, and contaminants. To further advance LLMs, we employed prompt engineering (i.e., five-shot learning) and fine-tuned open-sourced Llama3-8B into a specialized model, namely, WaterGPT. WaterGPT exhibited enhanced reasoning capabilities, outperforming Llama3-8B by over 135.4% on English engineering tasks and 18.8% on research tasks. Additionally, fine-tuning proved to be more reliable and effective than prompt engineering. Collectively, this study established various LLMs’ baseline performance in water sectors while highlighting the robust evaluation frameworks and augmentation techniques to ensure the effective and reliable use of LLMs.</p>","PeriodicalId":37,"journal":{"name":"Environmental Science & Technology Letters Environ.","volume":"12 3","pages":"289–296 289–296"},"PeriodicalIF":8.9000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Science & Technology Letters Environ.","FirstCategoryId":"1","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.estlett.5c00038","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Although large language models (LLMs) have demonstrated significant value in numerous fields, there remains limited research on evaluating their performance or enhancing their capabilities within water science and technology. This study initially assessed the performance of eight foundational models (i.e., GPT-4, GPT-3.5, Gemini, GLM-4, ERNIE, QWEN, Llama3-8B, and Llama3-70B) on a wide range of water knowledge tasks in engineering and research by developing an evaluation suite called WaterER (i.e., 1043 tasks). GPT-4 was demonstrated to excel in diverse water knowledge tasks in engineering and research. Llama3-70B was best for Chinese engineering queries, while Chinese-oriented models outperformed GPT-3.5 in English engineering tasks. Gemini demonstrated specialized academic capabilities in wastewater treatment, environmental restoration, drinking water treatment, sanitation, anaerobic digestion, and contaminants. To further advance LLMs, we employed prompt engineering (i.e., five-shot learning) and fine-tuned open-sourced Llama3-8B into a specialized model, namely, WaterGPT. WaterGPT exhibited enhanced reasoning capabilities, outperforming Llama3-8B by over 135.4% on English engineering tasks and 18.8% on research tasks. Additionally, fine-tuning proved to be more reliable and effective than prompt engineering. Collectively, this study established various LLMs’ baseline performance in water sectors while highlighting the robust evaluation frameworks and augmentation techniques to ensure the effective and reliable use of LLMs.
期刊介绍:
Environmental Science & Technology Letters serves as an international forum for brief communications on experimental or theoretical results of exceptional timeliness in all aspects of environmental science, both pure and applied. Published as soon as accepted, these communications are summarized in monthly issues. Additionally, the journal features short reviews on emerging topics in environmental science and technology.