Combining categorical boosting and Shapley additive explanations for building an interpretable ensemble classifier for identifying mineralization-related geochemical anomalies

IF 3.2 2区 地球科学 Q1 GEOLOGY
Yongliang Chen , Bowen Chen , Alina Shayilan
{"title":"Combining categorical boosting and Shapley additive explanations for building an interpretable ensemble classifier for identifying mineralization-related geochemical anomalies","authors":"Yongliang Chen ,&nbsp;Bowen Chen ,&nbsp;Alina Shayilan","doi":"10.1016/j.oregeorev.2024.106263","DOIUrl":null,"url":null,"abstract":"<div><div>The vast majority of shallow and deep learning techniques used to identify mineralization-related geochemical anomalies are black-box algorithms that lack the ability to elucidate the individual contributions of each element towards the model predictions. In addition, most of the anomaly identification models established by both shallow and deep learning algorithms lack robustness. Establishing interpretable and robust machine learning models is a challenge in applying machine learning techniques to geochemical anomaly identification. To this end, the categorical boosting (CatBoost) algorithm was employed to build a robust ensemble classifier to identify mineralization-related anomalies from the 1:50,000 geochemical reconnaissance data (stream sediment survey) in the Yeniugou area of Xinjiang (China). The receiver operating characteristic curve (ROC) and precision-recall (P-R) curve of the ensemble model were plotted, and the area under the ROC curve (AUC) as well as the area under the P-R curve (AUPRC) of the ensemble model were calculated to measure the performance of the ensemble model. The ROC curve of the ensemble model approximates that of the perfect classification model. The P-R curve of the ensemble model is close to the upper right corner of the P-R space. The AUC and AUPRC values of the ensemble model reaches 0.9981 and 0.7816, respectively. The identified polymetallic mineralization-related geochemical anomalies account for 3% of the whole exploration area, correctly identifying all known polymetallic deposits. To enhance the interpretability of the CatBoost model, the Shapley additive explanations (SHAP) tool was adopted to graphically interpret the predictions of the ensemble model. The graphic interpretation shows that the importance order of the 14 elements is Ni-Au-Ag-Sn-As-Cr-Zn-Cu-Pb-Sb-W-Bi-Mo-Co. Cu and Ni are most likely metallogenic elements of the study area. Cu interacts with Ni, Ag, As, Sn, Cr, Zn, Pb, Sb, W, Bi, and Co; and Ni interacts with Au, Sn, As, Zn, Cu, W, Bi, and Co. Two polymetallic prospective areas were delineated in the study area. One is Cu-Ni-polymetallic mineralization prospective area, and the other is Ni-polymetallic mineralization prospective area. It can be concluded that the combination of CatBoost and SHAP is an effective way to construct an interpretable ensemble model with high-performance and robustness in identifying mineralization-related geochemical anomalies.</div></div>","PeriodicalId":19644,"journal":{"name":"Ore Geology Reviews","volume":"173 ","pages":"Article 106263"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ore Geology Reviews","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169136824003962","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

The vast majority of shallow and deep learning techniques used to identify mineralization-related geochemical anomalies are black-box algorithms that lack the ability to elucidate the individual contributions of each element towards the model predictions. In addition, most of the anomaly identification models established by both shallow and deep learning algorithms lack robustness. Establishing interpretable and robust machine learning models is a challenge in applying machine learning techniques to geochemical anomaly identification. To this end, the categorical boosting (CatBoost) algorithm was employed to build a robust ensemble classifier to identify mineralization-related anomalies from the 1:50,000 geochemical reconnaissance data (stream sediment survey) in the Yeniugou area of Xinjiang (China). The receiver operating characteristic curve (ROC) and precision-recall (P-R) curve of the ensemble model were plotted, and the area under the ROC curve (AUC) as well as the area under the P-R curve (AUPRC) of the ensemble model were calculated to measure the performance of the ensemble model. The ROC curve of the ensemble model approximates that of the perfect classification model. The P-R curve of the ensemble model is close to the upper right corner of the P-R space. The AUC and AUPRC values of the ensemble model reaches 0.9981 and 0.7816, respectively. The identified polymetallic mineralization-related geochemical anomalies account for 3% of the whole exploration area, correctly identifying all known polymetallic deposits. To enhance the interpretability of the CatBoost model, the Shapley additive explanations (SHAP) tool was adopted to graphically interpret the predictions of the ensemble model. The graphic interpretation shows that the importance order of the 14 elements is Ni-Au-Ag-Sn-As-Cr-Zn-Cu-Pb-Sb-W-Bi-Mo-Co. Cu and Ni are most likely metallogenic elements of the study area. Cu interacts with Ni, Ag, As, Sn, Cr, Zn, Pb, Sb, W, Bi, and Co; and Ni interacts with Au, Sn, As, Zn, Cu, W, Bi, and Co. Two polymetallic prospective areas were delineated in the study area. One is Cu-Ni-polymetallic mineralization prospective area, and the other is Ni-polymetallic mineralization prospective area. It can be concluded that the combination of CatBoost and SHAP is an effective way to construct an interpretable ensemble model with high-performance and robustness in identifying mineralization-related geochemical anomalies.
结合分类提升和沙普利加法解释,建立可解释的集合分类器,识别与成矿有关的地球化学异常现象
绝大多数用于识别与成矿有关的地球化学异常的浅层和深度学习技术都是黑箱算法,无法阐明每个元素对模型预测的单独贡献。此外,浅层和深度学习算法建立的异常识别模型大多缺乏稳健性。建立可解释且稳健的机器学习模型是将机器学习技术应用于地球化学异常识别的一大挑战。为此,我们采用分类提升(CatBoost)算法建立了一个稳健的集合分类器,从中国新疆叶牛沟地区 1:50,000 地球化学勘查数据(河流沉积物调查)中识别与矿化相关的异常。绘制了集合模型的接收者操作特征曲线(ROC)和精度-召回曲线(P-R),并计算了集合模型的ROC曲线下面积(AUC)和P-R曲线下面积(AUPRC),以衡量集合模型的性能。集合模型的 ROC 曲线近似于完美分类模型的 ROC 曲线。集合模型的 P-R 曲线接近 P-R 空间的右上角。集合模型的 AUC 值和 AUPRC 值分别达到 0.9981 和 0.7816。确定的多金属矿化相关地球化学异常占整个勘探区域的 3%,正确识别了所有已知的多金属矿床。为了增强 CatBoost 模型的可解释性,采用了 Shapley 加性解释(SHAP)工具,对集合模型的预测结果进行图形解释。图形解释显示,14 种元素的重要性顺序为:Ni-Au-Ag-Sn-As-Cr-Zn-Cu-Pb-Sb-W-Bi-Mo-Co。铜和镍最有可能是研究区域的成矿元素。铜与镍、银、砷、锡、铬、锌、铅、锑、钨、铋和钴相互作用;镍与金、锡、砷、锌、铜、钨、铋和钴相互作用。研究区内划定了两个多金属远景区。一个是铜镍多金属成矿远景区,另一个是镍多金属成矿远景区。可以得出结论,CatBoost 和 SHAP 的结合是构建可解释集合模型的有效方法,在识别与成矿相关的地球化学异常方面具有高性能和鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Ore Geology Reviews
Ore Geology Reviews 地学-地质学
CiteScore
6.50
自引率
27.30%
发文量
546
审稿时长
22.9 weeks
期刊介绍: Ore Geology Reviews aims to familiarize all earth scientists with recent advances in a number of interconnected disciplines related to the study of, and search for, ore deposits. The reviews range from brief to longer contributions, but the journal preferentially publishes manuscripts that fill the niche between the commonly shorter journal articles and the comprehensive book coverages, and thus has a special appeal to many authors and readers.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信