A systematic evaluation of data preprocessing and model optimization for machine learning algorithms: Using sphalerite trace element data as an example

IF 2.7 3区 地球科学 Q2 GEOSCIENCES, MULTIDISCIPLINARY
Ying-Hui Gao , Xiao-Wen Huang , Yu-Miao Meng
{"title":"A systematic evaluation of data preprocessing and model optimization for machine learning algorithms: Using sphalerite trace element data as an example","authors":"Ying-Hui Gao ,&nbsp;Xiao-Wen Huang ,&nbsp;Yu-Miao Meng","doi":"10.1016/j.jseaes.2025.106728","DOIUrl":null,"url":null,"abstract":"<div><div>Mineral chemistry combined with machine learning (ML) algorithms has been widely used in ore genesis and mineral exploration. However, a systematic investigation on the effect of data preprocessing and optimal parameters on the performance of ML methods was not enough. Based on a real unbalanced dataset with 4,312 trace element data of sphalerite from different types of deposits, this study investigated the classification performance of eight ML algorithms, including unsupervised principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), and supervised linear discriminant analysis (LDA), partial least squares-discriminant analysis (PLS-DA), random forest (RF), support vector machine (SVM), and extremely greedy tree boosting (XGBoost). Different preprocessing methods, including missing value imputation, data transformation, and feature selection, were evaluated. For skewed distribution data, the study recommends using KNN, Mode, and Median imputation methods to reduce bias. The PCA, t-SNE, LDA, PLS-DA, and SVM algorithms perform well on data transformed by CLR or logarithmic methods, while RF and XGBoost algorithms also show good classification performance on untransformed data, especially XGBoost shows the best performance on data without imputation. Using the Recursive Feature Elimination (RFE) feature selection with the feature importance, the key features with the most discriminative capabilities can be screened out. This research not only optimizes mineral data processing workflows but also provides important support for improving the precision and accuracy of ML models.</div></div>","PeriodicalId":50253,"journal":{"name":"Journal of Asian Earth Sciences","volume":"292 ","pages":"Article 106728"},"PeriodicalIF":2.7000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Asian Earth Sciences","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1367912025002433","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GEOSCIENCES, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Mineral chemistry combined with machine learning (ML) algorithms has been widely used in ore genesis and mineral exploration. However, a systematic investigation on the effect of data preprocessing and optimal parameters on the performance of ML methods was not enough. Based on a real unbalanced dataset with 4,312 trace element data of sphalerite from different types of deposits, this study investigated the classification performance of eight ML algorithms, including unsupervised principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), and supervised linear discriminant analysis (LDA), partial least squares-discriminant analysis (PLS-DA), random forest (RF), support vector machine (SVM), and extremely greedy tree boosting (XGBoost). Different preprocessing methods, including missing value imputation, data transformation, and feature selection, were evaluated. For skewed distribution data, the study recommends using KNN, Mode, and Median imputation methods to reduce bias. The PCA, t-SNE, LDA, PLS-DA, and SVM algorithms perform well on data transformed by CLR or logarithmic methods, while RF and XGBoost algorithms also show good classification performance on untransformed data, especially XGBoost shows the best performance on data without imputation. Using the Recursive Feature Elimination (RFE) feature selection with the feature importance, the key features with the most discriminative capabilities can be screened out. This research not only optimizes mineral data processing workflows but also provides important support for improving the precision and accuracy of ML models.

Abstract Image

机器学习算法的数据预处理和模型优化系统评价:以闪锌矿微量元素数据为例
结合机器学习(ML)算法的矿物化学在成矿和矿产勘查中得到了广泛应用。然而,系统地研究数据预处理和最优参数对机器学习方法性能的影响还不够。基于一个包含4312个不同矿床闪锌矿微量元素数据的真实非平衡数据集,研究了8种机器学习算法的分类性能,包括无监督主成分分析(PCA)和t-分布随机邻居嵌入(t-SNE),以及监督线性判别分析(LDA)、偏最小二乘判别分析(PLS-DA)、随机森林(RF)、支持向量机(SVM)、和极其贪婪的树增益(XGBoost)。评估了缺失值输入、数据转换和特征选择等预处理方法。对于偏斜分布数据,研究建议使用KNN、Mode和Median imputation方法来减少偏倚。PCA、t-SNE、LDA、PLS-DA和SVM算法在CLR或对数变换数据上表现良好,RF和XGBoost算法在未变换数据上也表现出良好的分类性能,尤其是XGBoost算法在未插值数据上表现最好。采用基于特征重要性的递归特征消除(RFE)特征选择方法,筛选出判别能力最强的关键特征。该研究不仅优化了矿物数据处理工作流程,而且为提高机器学习模型的精密度和准确度提供了重要支持。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Asian Earth Sciences
Journal of Asian Earth Sciences 地学-地球科学综合
CiteScore
5.90
自引率
10.00%
发文量
324
审稿时长
71 days
期刊介绍: Journal of Asian Earth Sciences has an open access mirror journal Journal of Asian Earth Sciences: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review. The Journal of Asian Earth Sciences is an international interdisciplinary journal devoted to all aspects of research related to the solid Earth Sciences of Asia. The Journal publishes high quality, peer-reviewed scientific papers on the regional geology, tectonics, geochemistry and geophysics of Asia. It will be devoted primarily to research papers but short communications relating to new developments of broad interest, reviews and book reviews will also be included. Papers must have international appeal and should present work of more than local significance. The scope includes deep processes of the Asian continent and its adjacent oceans; seismology and earthquakes; orogeny, magmatism, metamorphism and volcanism; growth, deformation and destruction of the Asian crust; crust-mantle interaction; evolution of life (early life, biostratigraphy, biogeography and mass-extinction); fluids, fluxes and reservoirs of mineral and energy resources; surface processes (weathering, erosion, transport and deposition of sediments) and resulting geomorphology; and the response of the Earth to global climate change as viewed within the Asian continent and surrounding oceans.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信