A systematic evaluation of data preprocessing and model optimization for machine learning algorithms: Using sphalerite trace element data as an example

IF 2.7 3区地球科学 Q2 GEOSCIENCES, MULTIDISCIPLINARY

Journal of Asian Earth Sciences Pub Date : 2025-07-07 DOI:10.1016/j.jseaes.2025.106728

Ying-Hui Gao , Xiao-Wen Huang , Yu-Miao Meng

{"title":"A systematic evaluation of data preprocessing and model optimization for machine learning algorithms: Using sphalerite trace element data as an example","authors":"Ying-Hui Gao , Xiao-Wen Huang , Yu-Miao Meng","doi":"10.1016/j.jseaes.2025.106728","DOIUrl":null,"url":null,"abstract":"<div><div>Mineral chemistry combined with machine learning (ML) algorithms has been widely used in ore genesis and mineral exploration. However, a systematic investigation on the effect of data preprocessing and optimal parameters on the performance of ML methods was not enough. Based on a real unbalanced dataset with 4,312 trace element data of sphalerite from different types of deposits, this study investigated the classification performance of eight ML algorithms, including unsupervised principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), and supervised linear discriminant analysis (LDA), partial least squares-discriminant analysis (PLS-DA), random forest (RF), support vector machine (SVM), and extremely greedy tree boosting (XGBoost). Different preprocessing methods, including missing value imputation, data transformation, and feature selection, were evaluated. For skewed distribution data, the study recommends using KNN, Mode, and Median imputation methods to reduce bias. The PCA, t-SNE, LDA, PLS-DA, and SVM algorithms perform well on data transformed by CLR or logarithmic methods, while RF and XGBoost algorithms also show good classification performance on untransformed data, especially XGBoost shows the best performance on data without imputation. Using the Recursive Feature Elimination (RFE) feature selection with the feature importance, the key features with the most discriminative capabilities can be screened out. This research not only optimizes mineral data processing workflows but also provides important support for improving the precision and accuracy of ML models.</div></div>","PeriodicalId":50253,"journal":{"name":"Journal of Asian Earth Sciences","volume":"292 ","pages":"Article 106728"},"PeriodicalIF":2.7000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Asian Earth Sciences","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1367912025002433","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GEOSCIENCES, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Mineral chemistry combined with machine learning (ML) algorithms has been widely used in ore genesis and mineral exploration. However, a systematic investigation on the effect of data preprocessing and optimal parameters on the performance of ML methods was not enough. Based on a real unbalanced dataset with 4,312 trace element data of sphalerite from different types of deposits, this study investigated the classification performance of eight ML algorithms, including unsupervised principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), and supervised linear discriminant analysis (LDA), partial least squares-discriminant analysis (PLS-DA), random forest (RF), support vector machine (SVM), and extremely greedy tree boosting (XGBoost). Different preprocessing methods, including missing value imputation, data transformation, and feature selection, were evaluated. For skewed distribution data, the study recommends using KNN, Mode, and Median imputation methods to reduce bias. The PCA, t-SNE, LDA, PLS-DA, and SVM algorithms perform well on data transformed by CLR or logarithmic methods, while RF and XGBoost algorithms also show good classification performance on untransformed data, especially XGBoost shows the best performance on data without imputation. Using the Recursive Feature Elimination (RFE) feature selection with the feature importance, the key features with the most discriminative capabilities can be screened out. This research not only optimizes mineral data processing workflows but also provides important support for improving the precision and accuracy of ML models.

Abstract Image

查看原文本刊更多论文

机器学习算法的数据预处理和模型优化系统评价：以闪锌矿微量元素数据为例

结合机器学习（ML）算法的矿物化学在成矿和矿产勘查中得到了广泛应用。然而，系统地研究数据预处理和最优参数对机器学习方法性能的影响还不够。基于一个包含4312个不同矿床闪锌矿微量元素数据的真实非平衡数据集，研究了8种机器学习算法的分类性能，包括无监督主成分分析（PCA）和t-分布随机邻居嵌入（t-SNE），以及监督线性判别分析（LDA）、偏最小二乘判别分析（PLS-DA）、随机森林（RF）、支持向量机（SVM）、和极其贪婪的树增益（XGBoost）。评估了缺失值输入、数据转换和特征选择等预处理方法。对于偏斜分布数据，研究建议使用KNN、Mode和Median imputation方法来减少偏倚。PCA、t-SNE、LDA、PLS-DA和SVM算法在CLR或对数变换数据上表现良好，RF和XGBoost算法在未变换数据上也表现出良好的分类性能，尤其是XGBoost算法在未插值数据上表现最好。采用基于特征重要性的递归特征消除（RFE）特征选择方法，筛选出判别能力最强的关键特征。该研究不仅优化了矿物数据处理工作流程，而且为提高机器学习模型的精密度和准确度提供了重要支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Asian Earth Sciences 地学-地球科学综合

CiteScore

5.90

自引率

10.00%

发文量

324

审稿时长

71 days

期刊介绍： Journal of Asian Earth Sciences has an open access mirror journal Journal of Asian Earth Sciences: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review. The Journal of Asian Earth Sciences is an international interdisciplinary journal devoted to all aspects of research related to the solid Earth Sciences of Asia. The Journal publishes high quality, peer-reviewed scientific papers on the regional geology, tectonics, geochemistry and geophysics of Asia. It will be devoted primarily to research papers but short communications relating to new developments of broad interest, reviews and book reviews will also be included. Papers must have international appeal and should present work of more than local significance. The scope includes deep processes of the Asian continent and its adjacent oceans; seismology and earthquakes; orogeny, magmatism, metamorphism and volcanism; growth, deformation and destruction of the Asian crust; crust-mantle interaction; evolution of life (early life, biostratigraphy, biogeography and mass-extinction); fluids, fluxes and reservoirs of mineral and energy resources; surface processes (weathering, erosion, transport and deposition of sediments) and resulting geomorphology; and the response of the Earth to global climate change as viewed within the Asian continent and surrounding oceans.