{"title":"Quantitative prediction of soil AS content based on variational auto-encoder generated samples coupled with machine learning","authors":"Chengbiao Fu , Qingyuan Zhuang , Anhong Tian","doi":"10.1016/j.chemolab.2025.105486","DOIUrl":null,"url":null,"abstract":"<div><div>This study aims to enhance the prediction accuracy of soil arsenic content, which is currently constrained by limited sample data. To address this limitation, we propose a method that employs variational auto-encoder (VAE) to generate additional samples for augmenting the original training dataset. The proposed approach was validated using contaminated farmland soil samples collected from Yunnan Province as the research object. We applied Savitzky-Golay (SG) smoothing and Standard Normal Variate (SNV) to preprocess the hyperspectral data and feature bands were extracted through Successive Projections Algorithm (SPA). In terms of modelling, four machine learning models (PLSR, SVR, RBF, GBM) were used to establish prediction models for soil arsenic (As) content. The predictive ability of the models was evaluated by three indices: coefficient of determination (R<sup>2</sup>), root mean square error (RMSE) and ratio of the performance to deviation(RPD). The results show that after augmenting the real training dataset with samples generated by VAE, the predictive capabilities of the four models improved to varying degrees, and the models' overfitting problems were effectively alleviated. The RPD value of the PLSR model ameliorated from 1.682 to 2.226 after using the generated sample. Meanwhile the RPD values of the remaining three machine learning models (SVR, RBF, GBM) are raised above 3.000. Notably, the GBM model demonstrated the most significant performance improvement, with its RPD value increasing from 1.566 to 3.326. What's more, the number of generated samples affects the prediction accuracy of the model. On the one hand, too few generated samples make the prediction accuracy of the model unsatisfactory. On the other hand, too many generated samples will lead to a decline in the prediction performance of the model. When the VAE network is at the 16000th iteration, the generated samples are highly similar to the real training data set. The average structural similarity index measure and average peak-signal-to-noise ratio obtained are 0.972 and 20.558 dB respectively, and the Pearson correlation coefficient is 0.861. The generated samples and real samples have significantly strong correlations. After the training data set was increased, the model with the best prediction performance was SVR. The R<sup>2</sup>, RMSE, and RPD of the validation set were 0.923, 72.187 mg kg<sup>−1</sup>, and 3.611 respectively. The number of extracted feature bands was 25, and the expansion included an additional 5 samples. In the meantime, the model with the largest improvement in predictive performance is GBM whose validation set R<sup>2</sup> improves by 0.318, RMSE decreases by 88.044 mg kg<sup>−1</sup>, and RPD improves by 1.760. This study proves that the data augmentation method based on VAE can effectively improve the feasibility of machine learning algorithms in predicting soil heavy metal arsenic content, and provides a new idea for improving model prediction performance without increasing sampling costs.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"265 ","pages":"Article 105486"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925001716","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
This study aims to enhance the prediction accuracy of soil arsenic content, which is currently constrained by limited sample data. To address this limitation, we propose a method that employs variational auto-encoder (VAE) to generate additional samples for augmenting the original training dataset. The proposed approach was validated using contaminated farmland soil samples collected from Yunnan Province as the research object. We applied Savitzky-Golay (SG) smoothing and Standard Normal Variate (SNV) to preprocess the hyperspectral data and feature bands were extracted through Successive Projections Algorithm (SPA). In terms of modelling, four machine learning models (PLSR, SVR, RBF, GBM) were used to establish prediction models for soil arsenic (As) content. The predictive ability of the models was evaluated by three indices: coefficient of determination (R2), root mean square error (RMSE) and ratio of the performance to deviation(RPD). The results show that after augmenting the real training dataset with samples generated by VAE, the predictive capabilities of the four models improved to varying degrees, and the models' overfitting problems were effectively alleviated. The RPD value of the PLSR model ameliorated from 1.682 to 2.226 after using the generated sample. Meanwhile the RPD values of the remaining three machine learning models (SVR, RBF, GBM) are raised above 3.000. Notably, the GBM model demonstrated the most significant performance improvement, with its RPD value increasing from 1.566 to 3.326. What's more, the number of generated samples affects the prediction accuracy of the model. On the one hand, too few generated samples make the prediction accuracy of the model unsatisfactory. On the other hand, too many generated samples will lead to a decline in the prediction performance of the model. When the VAE network is at the 16000th iteration, the generated samples are highly similar to the real training data set. The average structural similarity index measure and average peak-signal-to-noise ratio obtained are 0.972 and 20.558 dB respectively, and the Pearson correlation coefficient is 0.861. The generated samples and real samples have significantly strong correlations. After the training data set was increased, the model with the best prediction performance was SVR. The R2, RMSE, and RPD of the validation set were 0.923, 72.187 mg kg−1, and 3.611 respectively. The number of extracted feature bands was 25, and the expansion included an additional 5 samples. In the meantime, the model with the largest improvement in predictive performance is GBM whose validation set R2 improves by 0.318, RMSE decreases by 88.044 mg kg−1, and RPD improves by 1.760. This study proves that the data augmentation method based on VAE can effectively improve the feasibility of machine learning algorithms in predicting soil heavy metal arsenic content, and provides a new idea for improving model prediction performance without increasing sampling costs.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.