基于SHAP值的机器学习模型在悬沙浓度预测中的解释

IF 3.5 2区 环境科学与生态学 Q2 ENVIRONMENTAL SCIENCES
Houda Lamane , Latifa Mouhir , Rachid Moussadek , Bouamar Baghdad , Ozgur Kisi , Ali El Bilali
{"title":"基于SHAP值的机器学习模型在悬沙浓度预测中的解释","authors":"Houda Lamane ,&nbsp;Latifa Mouhir ,&nbsp;Rachid Moussadek ,&nbsp;Bouamar Baghdad ,&nbsp;Ozgur Kisi ,&nbsp;Ali El Bilali","doi":"10.1016/j.ijsrc.2024.10.002","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning (ML) has become a powerful tool for predicting suspended sediment concentration (SSC). Nonetheless, the ability to interpret the physical process is considered the main issue in applying most of ML approaches. In this regard, the current study presents a novel framework involving four standalone ML models (extra trees (ET), random forest (RF), categorical boosting (CatBoost), and extreme gradient boosting (XGBoost)) and their combination with genetic programming (GP). Three metrics (coefficient of correlation (<em>r</em>), root mean square error (RMSE), and Nash–Sutcliffe model-fit efficiency (NSE)) and a more advanced interpretation system SHapley Additive exPlanations (SHAP) are used to assess the performance of these models applied to hydro-climatic datasets for prediction of SSC. The calibration process was based on data from 2016 to 2020, and the validation was done for 2021 data. Further description and application of the framework are provided based on a case study of the Bouregreg watershed. The results revealed that all implemented models are efficient in SSC prediction with NSE, RMSE, and <em>r</em> varying from 0.53 to 0.86, 1.20–2.55 g/L, and 0.83–0.91 g/L respectively. Box plot diagrams confirm the enhanced performance of these combined models, and the best-performing ones for the four hydrological stations being the combined RF + GP model at the Aguibat Ziar station, the combined XGBoost + GP model at the Ain Loudah station, the CatBoost model at the Ras Fathia station, and the RF model at the Sidi M<sup>ed</sup> Cherif station. The interpretability results showed that flow (<em>Q</em>) and seasonality (<em>S</em>) are the features most impacting SSC. These outcomes indicate that the applied models can extract accurate and detailed information from the interactions between the hydroclimatic factors and the generation of sediment by erosion (output). ML approaches illustrated the good reliability and transparency of the models developed for predicting SSC in a semi-arid setting, offered new perspectives for reducing ML models' “black box” character, and provided a useful source of information for assessing the consequences of SSC on water quality. The SHAP system and exploring other interpretable techniques are recommended to provide further information in future research. In addition, incorporating additional input data could enhance SSC predictions and deepen understanding of sediment transport dynamics.</div></div>","PeriodicalId":50290,"journal":{"name":"International Journal of Sediment Research","volume":"40 1","pages":"Pages 91-107"},"PeriodicalIF":3.5000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Interpreting machine learning models based on SHAP values in predicting suspended sediment concentration\",\"authors\":\"Houda Lamane ,&nbsp;Latifa Mouhir ,&nbsp;Rachid Moussadek ,&nbsp;Bouamar Baghdad ,&nbsp;Ozgur Kisi ,&nbsp;Ali El Bilali\",\"doi\":\"10.1016/j.ijsrc.2024.10.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Machine learning (ML) has become a powerful tool for predicting suspended sediment concentration (SSC). Nonetheless, the ability to interpret the physical process is considered the main issue in applying most of ML approaches. In this regard, the current study presents a novel framework involving four standalone ML models (extra trees (ET), random forest (RF), categorical boosting (CatBoost), and extreme gradient boosting (XGBoost)) and their combination with genetic programming (GP). Three metrics (coefficient of correlation (<em>r</em>), root mean square error (RMSE), and Nash–Sutcliffe model-fit efficiency (NSE)) and a more advanced interpretation system SHapley Additive exPlanations (SHAP) are used to assess the performance of these models applied to hydro-climatic datasets for prediction of SSC. The calibration process was based on data from 2016 to 2020, and the validation was done for 2021 data. Further description and application of the framework are provided based on a case study of the Bouregreg watershed. The results revealed that all implemented models are efficient in SSC prediction with NSE, RMSE, and <em>r</em> varying from 0.53 to 0.86, 1.20–2.55 g/L, and 0.83–0.91 g/L respectively. Box plot diagrams confirm the enhanced performance of these combined models, and the best-performing ones for the four hydrological stations being the combined RF + GP model at the Aguibat Ziar station, the combined XGBoost + GP model at the Ain Loudah station, the CatBoost model at the Ras Fathia station, and the RF model at the Sidi M<sup>ed</sup> Cherif station. The interpretability results showed that flow (<em>Q</em>) and seasonality (<em>S</em>) are the features most impacting SSC. These outcomes indicate that the applied models can extract accurate and detailed information from the interactions between the hydroclimatic factors and the generation of sediment by erosion (output). ML approaches illustrated the good reliability and transparency of the models developed for predicting SSC in a semi-arid setting, offered new perspectives for reducing ML models' “black box” character, and provided a useful source of information for assessing the consequences of SSC on water quality. The SHAP system and exploring other interpretable techniques are recommended to provide further information in future research. In addition, incorporating additional input data could enhance SSC predictions and deepen understanding of sediment transport dynamics.</div></div>\",\"PeriodicalId\":50290,\"journal\":{\"name\":\"International Journal of Sediment Research\",\"volume\":\"40 1\",\"pages\":\"Pages 91-107\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Sediment Research\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1001627924001070\",\"RegionNum\":2,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENVIRONMENTAL SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Sediment Research","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1001627924001070","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

机器学习(ML)已经成为预测悬浮泥沙浓度(SSC)的有力工具。尽管如此,解释物理过程的能力被认为是应用大多数ML方法的主要问题。在这方面,目前的研究提出了一个新的框架,涉及四个独立的机器学习模型(额外树(ET),随机森林(RF),分类增强(CatBoost)和极端梯度增强(XGBoost))及其与遗传规划(GP)的结合。三个指标(相关系数(r)、均方根误差(RMSE)和纳什-萨特克里夫模型拟合效率(NSE))和一个更先进的解释系统SHapley加性解释(SHAP)被用来评估这些模型在水文气候数据集上用于预测SSC的性能。校准过程基于2016年至2020年的数据,并对2021年的数据进行验证。以布雷格流域为例,对该框架进行了进一步的描述和应用。结果表明,所有模型均能有效预测SSC, NSE、RMSE和r分别在0.53 ~ 0.86、1.20 ~ 2.55 g/L和0.83 ~ 0.91 g/L之间。箱形图证实了这些组合模型的增强性能,四个水文站中表现最好的是Aguibat Ziar站的RF + GP组合模型、Ain Loudah站的XGBoost + GP组合模型、Ras Fathia站的CatBoost模型和Sidi Med Cherif站的RF模型。可解释性结果表明,流量(Q)和季节性(S)是影响SSC的主要特征。这些结果表明,应用的模型可以准确、详细地提取水文气候因子与侵蚀产沙(输出)之间的相互作用信息。ML方法说明了在半干旱环境下为预测SSC而开发的模型具有良好的可靠性和透明度,为减少ML模型的“黑箱”特征提供了新的视角,并为评估SSC对水质的影响提供了有用的信息来源。建议使用SHAP系统和探索其他可解释的技术为今后的研究提供进一步的信息。此外,纳入额外的输入数据可以增强SSC预测并加深对沉积物输运动力学的理解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Interpreting machine learning models based on SHAP values in predicting suspended sediment concentration
Machine learning (ML) has become a powerful tool for predicting suspended sediment concentration (SSC). Nonetheless, the ability to interpret the physical process is considered the main issue in applying most of ML approaches. In this regard, the current study presents a novel framework involving four standalone ML models (extra trees (ET), random forest (RF), categorical boosting (CatBoost), and extreme gradient boosting (XGBoost)) and their combination with genetic programming (GP). Three metrics (coefficient of correlation (r), root mean square error (RMSE), and Nash–Sutcliffe model-fit efficiency (NSE)) and a more advanced interpretation system SHapley Additive exPlanations (SHAP) are used to assess the performance of these models applied to hydro-climatic datasets for prediction of SSC. The calibration process was based on data from 2016 to 2020, and the validation was done for 2021 data. Further description and application of the framework are provided based on a case study of the Bouregreg watershed. The results revealed that all implemented models are efficient in SSC prediction with NSE, RMSE, and r varying from 0.53 to 0.86, 1.20–2.55 g/L, and 0.83–0.91 g/L respectively. Box plot diagrams confirm the enhanced performance of these combined models, and the best-performing ones for the four hydrological stations being the combined RF + GP model at the Aguibat Ziar station, the combined XGBoost + GP model at the Ain Loudah station, the CatBoost model at the Ras Fathia station, and the RF model at the Sidi Med Cherif station. The interpretability results showed that flow (Q) and seasonality (S) are the features most impacting SSC. These outcomes indicate that the applied models can extract accurate and detailed information from the interactions between the hydroclimatic factors and the generation of sediment by erosion (output). ML approaches illustrated the good reliability and transparency of the models developed for predicting SSC in a semi-arid setting, offered new perspectives for reducing ML models' “black box” character, and provided a useful source of information for assessing the consequences of SSC on water quality. The SHAP system and exploring other interpretable techniques are recommended to provide further information in future research. In addition, incorporating additional input data could enhance SSC predictions and deepen understanding of sediment transport dynamics.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Sediment Research
International Journal of Sediment Research 环境科学-环境科学
CiteScore
6.90
自引率
5.60%
发文量
88
审稿时长
74 days
期刊介绍: International Journal of Sediment Research, the Official Journal of The International Research and Training Center on Erosion and Sedimentation and The World Association for Sedimentation and Erosion Research, publishes scientific and technical papers on all aspects of erosion and sedimentation interpreted in its widest sense. The subject matter is to include not only the mechanics of sediment transport and fluvial processes, but also what is related to geography, geomorphology, soil erosion, watershed management, sedimentology, environmental and ecological impacts of sedimentation, social and economical effects of sedimentation and its assessment, etc. Special attention is paid to engineering problems related to sedimentation and erosion.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信