A systematic evaluation of data preprocessing and model optimization for machine learning algorithms: Using sphalerite trace element data as an example
{"title":"A systematic evaluation of data preprocessing and model optimization for machine learning algorithms: Using sphalerite trace element data as an example","authors":"Ying-Hui Gao , Xiao-Wen Huang , Yu-Miao Meng","doi":"10.1016/j.jseaes.2025.106728","DOIUrl":null,"url":null,"abstract":"<div><div>Mineral chemistry combined with machine learning (ML) algorithms has been widely used in ore genesis and mineral exploration. However, a systematic investigation on the effect of data preprocessing and optimal parameters on the performance of ML methods was not enough. Based on a real unbalanced dataset with 4,312 trace element data of sphalerite from different types of deposits, this study investigated the classification performance of eight ML algorithms, including unsupervised principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), and supervised linear discriminant analysis (LDA), partial least squares-discriminant analysis (PLS-DA), random forest (RF), support vector machine (SVM), and extremely greedy tree boosting (XGBoost). Different preprocessing methods, including missing value imputation, data transformation, and feature selection, were evaluated. For skewed distribution data, the study recommends using KNN, Mode, and Median imputation methods to reduce bias. The PCA, t-SNE, LDA, PLS-DA, and SVM algorithms perform well on data transformed by CLR or logarithmic methods, while RF and XGBoost algorithms also show good classification performance on untransformed data, especially XGBoost shows the best performance on data without imputation. Using the Recursive Feature Elimination (RFE) feature selection with the feature importance, the key features with the most discriminative capabilities can be screened out. This research not only optimizes mineral data processing workflows but also provides important support for improving the precision and accuracy of ML models.</div></div>","PeriodicalId":50253,"journal":{"name":"Journal of Asian Earth Sciences","volume":"292 ","pages":"Article 106728"},"PeriodicalIF":2.7000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Asian Earth Sciences","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1367912025002433","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GEOSCIENCES, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Mineral chemistry combined with machine learning (ML) algorithms has been widely used in ore genesis and mineral exploration. However, a systematic investigation on the effect of data preprocessing and optimal parameters on the performance of ML methods was not enough. Based on a real unbalanced dataset with 4,312 trace element data of sphalerite from different types of deposits, this study investigated the classification performance of eight ML algorithms, including unsupervised principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), and supervised linear discriminant analysis (LDA), partial least squares-discriminant analysis (PLS-DA), random forest (RF), support vector machine (SVM), and extremely greedy tree boosting (XGBoost). Different preprocessing methods, including missing value imputation, data transformation, and feature selection, were evaluated. For skewed distribution data, the study recommends using KNN, Mode, and Median imputation methods to reduce bias. The PCA, t-SNE, LDA, PLS-DA, and SVM algorithms perform well on data transformed by CLR or logarithmic methods, while RF and XGBoost algorithms also show good classification performance on untransformed data, especially XGBoost shows the best performance on data without imputation. Using the Recursive Feature Elimination (RFE) feature selection with the feature importance, the key features with the most discriminative capabilities can be screened out. This research not only optimizes mineral data processing workflows but also provides important support for improving the precision and accuracy of ML models.
期刊介绍:
Journal of Asian Earth Sciences has an open access mirror journal Journal of Asian Earth Sciences: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review.
The Journal of Asian Earth Sciences is an international interdisciplinary journal devoted to all aspects of research related to the solid Earth Sciences of Asia. The Journal publishes high quality, peer-reviewed scientific papers on the regional geology, tectonics, geochemistry and geophysics of Asia. It will be devoted primarily to research papers but short communications relating to new developments of broad interest, reviews and book reviews will also be included. Papers must have international appeal and should present work of more than local significance.
The scope includes deep processes of the Asian continent and its adjacent oceans; seismology and earthquakes; orogeny, magmatism, metamorphism and volcanism; growth, deformation and destruction of the Asian crust; crust-mantle interaction; evolution of life (early life, biostratigraphy, biogeography and mass-extinction); fluids, fluxes and reservoirs of mineral and energy resources; surface processes (weathering, erosion, transport and deposition of sediments) and resulting geomorphology; and the response of the Earth to global climate change as viewed within the Asian continent and surrounding oceans.