{"title":"pH和温度对底栖动物叶绿素-a的影响:来自shape - xgboost和随机森林模型的启示","authors":"Sangar Khan , Noël P.D. Juvigny-Khenafou , Tatenda Dalu , Paul J. Milham , Yasir Hamid , Kamel Mohamed Eltohamy , Habib Ullah , Bahman Jabbarian Amiri , Hao Chen , Naicheng Wu","doi":"10.1016/j.ecoinf.2025.103355","DOIUrl":null,"url":null,"abstract":"<div><div>Biological threats to river health relate to algal biomass, for which benthic chlorophyll–<em>a</em> (chl–<em>a</em>) is an indicator; consequently, predicting chl–<em>a</em> helps understand ecosystem dynamics. There is little information on machine learning predictive models of benthic chl–<em>a</em> and input parameters in lotic ecosystems, and to fill this gap, we predict benthic chl–<em>a</em> levels in China's Thousand Islands Lake (TIL) watershed using machine learning algorithms. Water samples for nutrient and metal analysis were collected across 147 sites in the TIL catchment. We employed Random Forest (RF), eXtreme gradient boosting (XGBoost) and SHAP-enhanced eXtreme gradient boosting (SHAP XGBoost) models, alongside Support Vector Regression (SVR), to predict chl–<em>a</em> levels in diverse reaches and identify the key determinants. The XGBoost outperformed the RF model in the test, training and validation datasets. In the SHAP XGBoost, pH was the most important characteristic, followed by mean average temperature (AT). The SVR demonstrated that AT is vital for the upper and middle catchment reaches, while pH is more important in the lower reaches. In partial dependence plots, the chl–<em>a</em> concentration depended highly on pH and AT. High pH and AT released P from stream colloids, lowered colloid adsorption, increasing chl–<em>a</em> concentration. We concluded that the SHAP XGBoost model could be used to identify the key determinants of chl–<em>a</em> from chemical and physical variables in the lotic system.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":"91 ","pages":"Article 103355"},"PeriodicalIF":7.3000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The influence of pH and temperature on benthic chlorophyll-a: Insights from SHAP-XGBoost and random forest models\",\"authors\":\"Sangar Khan , Noël P.D. Juvigny-Khenafou , Tatenda Dalu , Paul J. Milham , Yasir Hamid , Kamel Mohamed Eltohamy , Habib Ullah , Bahman Jabbarian Amiri , Hao Chen , Naicheng Wu\",\"doi\":\"10.1016/j.ecoinf.2025.103355\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Biological threats to river health relate to algal biomass, for which benthic chlorophyll–<em>a</em> (chl–<em>a</em>) is an indicator; consequently, predicting chl–<em>a</em> helps understand ecosystem dynamics. There is little information on machine learning predictive models of benthic chl–<em>a</em> and input parameters in lotic ecosystems, and to fill this gap, we predict benthic chl–<em>a</em> levels in China's Thousand Islands Lake (TIL) watershed using machine learning algorithms. Water samples for nutrient and metal analysis were collected across 147 sites in the TIL catchment. We employed Random Forest (RF), eXtreme gradient boosting (XGBoost) and SHAP-enhanced eXtreme gradient boosting (SHAP XGBoost) models, alongside Support Vector Regression (SVR), to predict chl–<em>a</em> levels in diverse reaches and identify the key determinants. The XGBoost outperformed the RF model in the test, training and validation datasets. In the SHAP XGBoost, pH was the most important characteristic, followed by mean average temperature (AT). The SVR demonstrated that AT is vital for the upper and middle catchment reaches, while pH is more important in the lower reaches. In partial dependence plots, the chl–<em>a</em> concentration depended highly on pH and AT. High pH and AT released P from stream colloids, lowered colloid adsorption, increasing chl–<em>a</em> concentration. We concluded that the SHAP XGBoost model could be used to identify the key determinants of chl–<em>a</em> from chemical and physical variables in the lotic system.</div></div>\",\"PeriodicalId\":51024,\"journal\":{\"name\":\"Ecological Informatics\",\"volume\":\"91 \",\"pages\":\"Article 103355\"},\"PeriodicalIF\":7.3000,\"publicationDate\":\"2025-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ecological Informatics\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1574954125003644\",\"RegionNum\":2,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ECOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954125003644","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
对河流健康的生物威胁与藻类生物量有关,底栖藻类叶绿素- a (chl-a)是一个指标;因此,预测chl-a有助于理解生态系统动力学。关于生态系统中底栖生物chl-a的机器学习预测模型和输入参数的信息很少,为了填补这一空白,我们使用机器学习算法预测了中国千岛湖流域底栖生物chl-a的水平。在TIL集水区的147个地点收集了用于营养和金属分析的水样。我们采用随机森林(RF)、极端梯度增强(XGBoost)和SHAP增强的极端梯度增强(SHAP XGBoost)模型,以及支持向量回归(SVR)来预测不同河段的chl-a水平,并确定关键决定因素。XGBoost在测试、训练和验证数据集上都优于RF模型。在SHAP XGBoost中,pH值是最重要的特征,其次是平均温度(AT)。SVR结果表明,AT对流域中上游至关重要,而pH对流域下游更为重要。在部分依赖图中,chl-a浓度高度依赖于pH和AT。较高的pH和AT释放了流胶体中的磷,降低了胶体的吸附,增加了chl-a浓度。我们得出的结论是,SHAP XGBoost模型可以从液相系统中的化学和物理变量中识别出chl-a的关键决定因素。
The influence of pH and temperature on benthic chlorophyll-a: Insights from SHAP-XGBoost and random forest models
Biological threats to river health relate to algal biomass, for which benthic chlorophyll–a (chl–a) is an indicator; consequently, predicting chl–a helps understand ecosystem dynamics. There is little information on machine learning predictive models of benthic chl–a and input parameters in lotic ecosystems, and to fill this gap, we predict benthic chl–a levels in China's Thousand Islands Lake (TIL) watershed using machine learning algorithms. Water samples for nutrient and metal analysis were collected across 147 sites in the TIL catchment. We employed Random Forest (RF), eXtreme gradient boosting (XGBoost) and SHAP-enhanced eXtreme gradient boosting (SHAP XGBoost) models, alongside Support Vector Regression (SVR), to predict chl–a levels in diverse reaches and identify the key determinants. The XGBoost outperformed the RF model in the test, training and validation datasets. In the SHAP XGBoost, pH was the most important characteristic, followed by mean average temperature (AT). The SVR demonstrated that AT is vital for the upper and middle catchment reaches, while pH is more important in the lower reaches. In partial dependence plots, the chl–a concentration depended highly on pH and AT. High pH and AT released P from stream colloids, lowered colloid adsorption, increasing chl–a concentration. We concluded that the SHAP XGBoost model could be used to identify the key determinants of chl–a from chemical and physical variables in the lotic system.
期刊介绍:
The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change.
The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.