{"title":"The influence of pH and temperature on benthic chlorophyll-a: Insights from SHAP-XGBoost and random forest models","authors":"Sangar Khan , Noël P.D. Juvigny-Khenafou , Tatenda Dalu , Paul J. Milham , Yasir Hamid , Kamel Mohamed Eltohamy , Habib Ullah , Bahman Jabbarian Amiri , Hao Chen , Naicheng Wu","doi":"10.1016/j.ecoinf.2025.103355","DOIUrl":null,"url":null,"abstract":"<div><div>Biological threats to river health relate to algal biomass, for which benthic chlorophyll–<em>a</em> (chl–<em>a</em>) is an indicator; consequently, predicting chl–<em>a</em> helps understand ecosystem dynamics. There is little information on machine learning predictive models of benthic chl–<em>a</em> and input parameters in lotic ecosystems, and to fill this gap, we predict benthic chl–<em>a</em> levels in China's Thousand Islands Lake (TIL) watershed using machine learning algorithms. Water samples for nutrient and metal analysis were collected across 147 sites in the TIL catchment. We employed Random Forest (RF), eXtreme gradient boosting (XGBoost) and SHAP-enhanced eXtreme gradient boosting (SHAP XGBoost) models, alongside Support Vector Regression (SVR), to predict chl–<em>a</em> levels in diverse reaches and identify the key determinants. The XGBoost outperformed the RF model in the test, training and validation datasets. In the SHAP XGBoost, pH was the most important characteristic, followed by mean average temperature (AT). The SVR demonstrated that AT is vital for the upper and middle catchment reaches, while pH is more important in the lower reaches. In partial dependence plots, the chl–<em>a</em> concentration depended highly on pH and AT. High pH and AT released P from stream colloids, lowered colloid adsorption, increasing chl–<em>a</em> concentration. We concluded that the SHAP XGBoost model could be used to identify the key determinants of chl–<em>a</em> from chemical and physical variables in the lotic system.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":"91 ","pages":"Article 103355"},"PeriodicalIF":7.3000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954125003644","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Biological threats to river health relate to algal biomass, for which benthic chlorophyll–a (chl–a) is an indicator; consequently, predicting chl–a helps understand ecosystem dynamics. There is little information on machine learning predictive models of benthic chl–a and input parameters in lotic ecosystems, and to fill this gap, we predict benthic chl–a levels in China's Thousand Islands Lake (TIL) watershed using machine learning algorithms. Water samples for nutrient and metal analysis were collected across 147 sites in the TIL catchment. We employed Random Forest (RF), eXtreme gradient boosting (XGBoost) and SHAP-enhanced eXtreme gradient boosting (SHAP XGBoost) models, alongside Support Vector Regression (SVR), to predict chl–a levels in diverse reaches and identify the key determinants. The XGBoost outperformed the RF model in the test, training and validation datasets. In the SHAP XGBoost, pH was the most important characteristic, followed by mean average temperature (AT). The SVR demonstrated that AT is vital for the upper and middle catchment reaches, while pH is more important in the lower reaches. In partial dependence plots, the chl–a concentration depended highly on pH and AT. High pH and AT released P from stream colloids, lowered colloid adsorption, increasing chl–a concentration. We concluded that the SHAP XGBoost model could be used to identify the key determinants of chl–a from chemical and physical variables in the lotic system.
期刊介绍:
The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change.
The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.