Interpreting machine learning models based on SHAP values in predicting suspended sediment concentration

IF 3.5 2区 环境科学与生态学 Q2 ENVIRONMENTAL SCIENCES
Houda Lamane , Latifa Mouhir , Rachid Moussadek , Bouamar Baghdad , Ozgur Kisi , Ali El Bilali
{"title":"Interpreting machine learning models based on SHAP values in predicting suspended sediment concentration","authors":"Houda Lamane ,&nbsp;Latifa Mouhir ,&nbsp;Rachid Moussadek ,&nbsp;Bouamar Baghdad ,&nbsp;Ozgur Kisi ,&nbsp;Ali El Bilali","doi":"10.1016/j.ijsrc.2024.10.002","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning (ML) has become a powerful tool for predicting suspended sediment concentration (SSC). Nonetheless, the ability to interpret the physical process is considered the main issue in applying most of ML approaches. In this regard, the current study presents a novel framework involving four standalone ML models (extra trees (ET), random forest (RF), categorical boosting (CatBoost), and extreme gradient boosting (XGBoost)) and their combination with genetic programming (GP). Three metrics (coefficient of correlation (<em>r</em>), root mean square error (RMSE), and Nash–Sutcliffe model-fit efficiency (NSE)) and a more advanced interpretation system SHapley Additive exPlanations (SHAP) are used to assess the performance of these models applied to hydro-climatic datasets for prediction of SSC. The calibration process was based on data from 2016 to 2020, and the validation was done for 2021 data. Further description and application of the framework are provided based on a case study of the Bouregreg watershed. The results revealed that all implemented models are efficient in SSC prediction with NSE, RMSE, and <em>r</em> varying from 0.53 to 0.86, 1.20–2.55 g/L, and 0.83–0.91 g/L respectively. Box plot diagrams confirm the enhanced performance of these combined models, and the best-performing ones for the four hydrological stations being the combined RF + GP model at the Aguibat Ziar station, the combined XGBoost + GP model at the Ain Loudah station, the CatBoost model at the Ras Fathia station, and the RF model at the Sidi M<sup>ed</sup> Cherif station. The interpretability results showed that flow (<em>Q</em>) and seasonality (<em>S</em>) are the features most impacting SSC. These outcomes indicate that the applied models can extract accurate and detailed information from the interactions between the hydroclimatic factors and the generation of sediment by erosion (output). ML approaches illustrated the good reliability and transparency of the models developed for predicting SSC in a semi-arid setting, offered new perspectives for reducing ML models' “black box” character, and provided a useful source of information for assessing the consequences of SSC on water quality. The SHAP system and exploring other interpretable techniques are recommended to provide further information in future research. In addition, incorporating additional input data could enhance SSC predictions and deepen understanding of sediment transport dynamics.</div></div>","PeriodicalId":50290,"journal":{"name":"International Journal of Sediment Research","volume":"40 1","pages":"Pages 91-107"},"PeriodicalIF":3.5000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Sediment Research","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1001627924001070","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) has become a powerful tool for predicting suspended sediment concentration (SSC). Nonetheless, the ability to interpret the physical process is considered the main issue in applying most of ML approaches. In this regard, the current study presents a novel framework involving four standalone ML models (extra trees (ET), random forest (RF), categorical boosting (CatBoost), and extreme gradient boosting (XGBoost)) and their combination with genetic programming (GP). Three metrics (coefficient of correlation (r), root mean square error (RMSE), and Nash–Sutcliffe model-fit efficiency (NSE)) and a more advanced interpretation system SHapley Additive exPlanations (SHAP) are used to assess the performance of these models applied to hydro-climatic datasets for prediction of SSC. The calibration process was based on data from 2016 to 2020, and the validation was done for 2021 data. Further description and application of the framework are provided based on a case study of the Bouregreg watershed. The results revealed that all implemented models are efficient in SSC prediction with NSE, RMSE, and r varying from 0.53 to 0.86, 1.20–2.55 g/L, and 0.83–0.91 g/L respectively. Box plot diagrams confirm the enhanced performance of these combined models, and the best-performing ones for the four hydrological stations being the combined RF + GP model at the Aguibat Ziar station, the combined XGBoost + GP model at the Ain Loudah station, the CatBoost model at the Ras Fathia station, and the RF model at the Sidi Med Cherif station. The interpretability results showed that flow (Q) and seasonality (S) are the features most impacting SSC. These outcomes indicate that the applied models can extract accurate and detailed information from the interactions between the hydroclimatic factors and the generation of sediment by erosion (output). ML approaches illustrated the good reliability and transparency of the models developed for predicting SSC in a semi-arid setting, offered new perspectives for reducing ML models' “black box” character, and provided a useful source of information for assessing the consequences of SSC on water quality. The SHAP system and exploring other interpretable techniques are recommended to provide further information in future research. In addition, incorporating additional input data could enhance SSC predictions and deepen understanding of sediment transport dynamics.
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Journal of Sediment Research
International Journal of Sediment Research 环境科学-环境科学
CiteScore
6.90
自引率
5.60%
发文量
88
审稿时长
74 days
期刊介绍: International Journal of Sediment Research, the Official Journal of The International Research and Training Center on Erosion and Sedimentation and The World Association for Sedimentation and Erosion Research, publishes scientific and technical papers on all aspects of erosion and sedimentation interpreted in its widest sense. The subject matter is to include not only the mechanics of sediment transport and fluvial processes, but also what is related to geography, geomorphology, soil erosion, watershed management, sedimentology, environmental and ecological impacts of sedimentation, social and economical effects of sedimentation and its assessment, etc. Special attention is paid to engineering problems related to sedimentation and erosion.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信