{"title":"Improving fecal bacteria estimation using machine learning and explainable AI in four major rivers, South Korea.","authors":"SungMin Suh, JunGi Moon, Sangjin Jung, JongCheol Pyo","doi":"10.1016/j.scitotenv.2024.177459","DOIUrl":null,"url":null,"abstract":"<p><p>This study addresses the critical public health issue of fecal coliform contamination in the four major rivers in South Korea (Han, Nakdong, Geum, and Yeongsan rivers) by applying advanced machine learning (ML) algorithms combined with Explainable Artificial Intelligence to enhance both prediction accuracy and interpretability. Both traditional and machine learning models often face challenges in accurately estimating fecal coliform levels due to the complexity of environmental variables and data limitations. To address this limitation, we employed two tree-based models (i.e., random forest [RF] and extreme gradient boost [XGBoost]), and two neural network models (i.e., deep neural network and convolutional neural network [CNN]). we employed the use of Shapley Additive Explanations (SHAP) to facilitate a more comprehensive understanding of the influence exerted by each variable on the model's predictions. Based on a comprehensive dataset collected from the National Institute of Environmental Research covering 16 water quality parameters and meteorological data from 2014 to 2022, our study improved the accuracy of fecal coliform estimation using XGBoost and CNN models. The optimal result was obtained using XGBoost, which had a validation Nash-Sutcliffe efficiency of 0.597 in the Han River. In addition, this study provides insights into the significant factors influencing fecal coliform concentrations across different river environments using the SHAP model. The results indicated that the XGBoost model provided superior estimation accuracy and explanations for the contributions of variables. The SHAP results provided the precise contribution of each water quality variable that affected the fecal estimation results using the XGBoost model. The study facilitates an improved understanding of the relationship between water quality variables and fecal coliform contamination mechanisms in the four major rivers in South Korea.</p>","PeriodicalId":422,"journal":{"name":"Science of the Total Environment","volume":" ","pages":"177459"},"PeriodicalIF":8.2000,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science of the Total Environment","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1016/j.scitotenv.2024.177459","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/19 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
This study addresses the critical public health issue of fecal coliform contamination in the four major rivers in South Korea (Han, Nakdong, Geum, and Yeongsan rivers) by applying advanced machine learning (ML) algorithms combined with Explainable Artificial Intelligence to enhance both prediction accuracy and interpretability. Both traditional and machine learning models often face challenges in accurately estimating fecal coliform levels due to the complexity of environmental variables and data limitations. To address this limitation, we employed two tree-based models (i.e., random forest [RF] and extreme gradient boost [XGBoost]), and two neural network models (i.e., deep neural network and convolutional neural network [CNN]). we employed the use of Shapley Additive Explanations (SHAP) to facilitate a more comprehensive understanding of the influence exerted by each variable on the model's predictions. Based on a comprehensive dataset collected from the National Institute of Environmental Research covering 16 water quality parameters and meteorological data from 2014 to 2022, our study improved the accuracy of fecal coliform estimation using XGBoost and CNN models. The optimal result was obtained using XGBoost, which had a validation Nash-Sutcliffe efficiency of 0.597 in the Han River. In addition, this study provides insights into the significant factors influencing fecal coliform concentrations across different river environments using the SHAP model. The results indicated that the XGBoost model provided superior estimation accuracy and explanations for the contributions of variables. The SHAP results provided the precise contribution of each water quality variable that affected the fecal estimation results using the XGBoost model. The study facilitates an improved understanding of the relationship between water quality variables and fecal coliform contamination mechanisms in the four major rivers in South Korea.
期刊介绍:
The Science of the Total Environment is an international journal dedicated to scientific research on the environment and its interaction with humanity. It covers a wide range of disciplines and seeks to publish innovative, hypothesis-driven, and impactful research that explores the entire environment, including the atmosphere, lithosphere, hydrosphere, biosphere, and anthroposphere.
The journal's updated Aims & Scope emphasizes the importance of interdisciplinary environmental research with broad impact. Priority is given to studies that advance fundamental understanding and explore the interconnectedness of multiple environmental spheres. Field studies are preferred, while laboratory experiments must demonstrate significant methodological advancements or mechanistic insights with direct relevance to the environment.