Xiaoyun Wang, Jing Su, Yue Liu, Yao Ji, Qiuling Dang, Yuanyuan Sun, Quanli Liu
{"title":"Development of a rapid and cost-effective groundwater quality assessment model based on hybrid ensemble learning","authors":"Xiaoyun Wang, Jing Su, Yue Liu, Yao Ji, Qiuling Dang, Yuanyuan Sun, Quanli Liu","doi":"10.1016/j.ecolind.2025.113894","DOIUrl":null,"url":null,"abstract":"<div><div>Assessing groundwater quality and health risks using machine learning is receiving widespread concern. However, assessment accuracy and cost-effectiveness are key factors in determining the model implementation. Therefore, the main purpose of this study is to develop a convenient, low-cost, and accurate hybrid ensemble model to predict water quality index (WQI) and hazard index (HI). Firstly, Pearson correlation matrix and ‘SHAP’ value were compared to select the Optimum feature combination. Secondly, base learners were selected from 12 different machine learning candidates. And then select eXtreme Gradient Boosting (XGB) as meta learner to construct stacking and blending ensemble model. The prediction results of the base learners are averaged to obtain the prediction results of averaging ensemble model. Finally, evaluation matrix (R<sup>2</sup> and RMSE), <em>t</em>-test and probabilistic forecasting were integrated to assess models’ performance. The results show TDS, HCO<sub>3</sub><sup>–</sup>, Mg<sup>2+</sup>, SO<sub>4</sub><sup>2-</sup> is the best feature combination for WQI prediction, and Na<sup>+</sup>, Ca<sup>2+</sup>, Mg<sup>2+</sup>, HCO<sub>3</sub><sup>–</sup> is the best feature combination for HI prediction. SHAP value perform better than Pearson correlation matrix in reducing the number of input variables and improving model accuracy. The accuracy of stacking ensemble model on test/validation sets (average R<sup>2</sup> = 0.966/0.921 and 0.835/0.714 for WQI and HI respectively) significantly (p < 0.05) higher than the other models. The Stacking ensemble model developed in this study provides supports for governments to assess groundwater quality and formulate rational policies. Meanwhile, the integration of evaluation metrics and statistical analysis also offers new ideas for model evaluation in the environmental field.</div></div>","PeriodicalId":11459,"journal":{"name":"Ecological Indicators","volume":"178 ","pages":"Article 113894"},"PeriodicalIF":7.0000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Indicators","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1470160X25008246","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Assessing groundwater quality and health risks using machine learning is receiving widespread concern. However, assessment accuracy and cost-effectiveness are key factors in determining the model implementation. Therefore, the main purpose of this study is to develop a convenient, low-cost, and accurate hybrid ensemble model to predict water quality index (WQI) and hazard index (HI). Firstly, Pearson correlation matrix and ‘SHAP’ value were compared to select the Optimum feature combination. Secondly, base learners were selected from 12 different machine learning candidates. And then select eXtreme Gradient Boosting (XGB) as meta learner to construct stacking and blending ensemble model. The prediction results of the base learners are averaged to obtain the prediction results of averaging ensemble model. Finally, evaluation matrix (R2 and RMSE), t-test and probabilistic forecasting were integrated to assess models’ performance. The results show TDS, HCO3–, Mg2+, SO42- is the best feature combination for WQI prediction, and Na+, Ca2+, Mg2+, HCO3– is the best feature combination for HI prediction. SHAP value perform better than Pearson correlation matrix in reducing the number of input variables and improving model accuracy. The accuracy of stacking ensemble model on test/validation sets (average R2 = 0.966/0.921 and 0.835/0.714 for WQI and HI respectively) significantly (p < 0.05) higher than the other models. The Stacking ensemble model developed in this study provides supports for governments to assess groundwater quality and formulate rational policies. Meanwhile, the integration of evaluation metrics and statistical analysis also offers new ideas for model evaluation in the environmental field.
期刊介绍:
The ultimate aim of Ecological Indicators is to integrate the monitoring and assessment of ecological and environmental indicators with management practices. The journal provides a forum for the discussion of the applied scientific development and review of traditional indicator approaches as well as for theoretical, modelling and quantitative applications such as index development. Research into the following areas will be published.
• All aspects of ecological and environmental indicators and indices.
• New indicators, and new approaches and methods for indicator development, testing and use.
• Development and modelling of indices, e.g. application of indicator suites across multiple scales and resources.
• Analysis and research of resource, system- and scale-specific indicators.
• Methods for integration of social and other valuation metrics for the production of scientifically rigorous and politically-relevant assessments using indicator-based monitoring and assessment programs.
• How research indicators can be transformed into direct application for management purposes.
• Broader assessment objectives and methods, e.g. biodiversity, biological integrity, and sustainability, through the use of indicators.
• Resource-specific indicators such as landscape, agroecosystems, forests, wetlands, etc.