Quantitative prediction of soil AS content based on variational auto-encoder generated samples coupled with machine learning

IF 3.8 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems Pub Date : 2025-07-16 DOI:10.1016/j.chemolab.2025.105486

Chengbiao Fu , Qingyuan Zhuang , Anhong Tian

{"title":"Quantitative prediction of soil AS content based on variational auto-encoder generated samples coupled with machine learning","authors":"Chengbiao Fu , Qingyuan Zhuang , Anhong Tian","doi":"10.1016/j.chemolab.2025.105486","DOIUrl":null,"url":null,"abstract":"<div><div>This study aims to enhance the prediction accuracy of soil arsenic content, which is currently constrained by limited sample data. To address this limitation, we propose a method that employs variational auto-encoder (VAE) to generate additional samples for augmenting the original training dataset. The proposed approach was validated using contaminated farmland soil samples collected from Yunnan Province as the research object. We applied Savitzky-Golay (SG) smoothing and Standard Normal Variate (SNV) to preprocess the hyperspectral data and feature bands were extracted through Successive Projections Algorithm (SPA). In terms of modelling, four machine learning models (PLSR, SVR, RBF, GBM) were used to establish prediction models for soil arsenic (As) content. The predictive ability of the models was evaluated by three indices: coefficient of determination (R<sup>2</sup>), root mean square error (RMSE) and ratio of the performance to deviation(RPD). The results show that after augmenting the real training dataset with samples generated by VAE, the predictive capabilities of the four models improved to varying degrees, and the models' overfitting problems were effectively alleviated. The RPD value of the PLSR model ameliorated from 1.682 to 2.226 after using the generated sample. Meanwhile the RPD values of the remaining three machine learning models (SVR, RBF, GBM) are raised above 3.000. Notably, the GBM model demonstrated the most significant performance improvement, with its RPD value increasing from 1.566 to 3.326. What's more, the number of generated samples affects the prediction accuracy of the model. On the one hand, too few generated samples make the prediction accuracy of the model unsatisfactory. On the other hand, too many generated samples will lead to a decline in the prediction performance of the model. When the VAE network is at the 16000th iteration, the generated samples are highly similar to the real training data set. The average structural similarity index measure and average peak-signal-to-noise ratio obtained are 0.972 and 20.558 dB respectively, and the Pearson correlation coefficient is 0.861. The generated samples and real samples have significantly strong correlations. After the training data set was increased, the model with the best prediction performance was SVR. The R<sup>2</sup>, RMSE, and RPD of the validation set were 0.923, 72.187 mg kg<sup>−1</sup>, and 3.611 respectively. The number of extracted feature bands was 25, and the expansion included an additional 5 samples. In the meantime, the model with the largest improvement in predictive performance is GBM whose validation set R<sup>2</sup> improves by 0.318, RMSE decreases by 88.044 mg kg<sup>−1</sup>, and RPD improves by 1.760. This study proves that the data augmentation method based on VAE can effectively improve the feasibility of machine learning algorithms in predicting soil heavy metal arsenic content, and provides a new idea for improving model prediction performance without increasing sampling costs.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"265 ","pages":"Article 105486"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925001716","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

This study aims to enhance the prediction accuracy of soil arsenic content, which is currently constrained by limited sample data. To address this limitation, we propose a method that employs variational auto-encoder (VAE) to generate additional samples for augmenting the original training dataset. The proposed approach was validated using contaminated farmland soil samples collected from Yunnan Province as the research object. We applied Savitzky-Golay (SG) smoothing and Standard Normal Variate (SNV) to preprocess the hyperspectral data and feature bands were extracted through Successive Projections Algorithm (SPA). In terms of modelling, four machine learning models (PLSR, SVR, RBF, GBM) were used to establish prediction models for soil arsenic (As) content. The predictive ability of the models was evaluated by three indices: coefficient of determination (R²), root mean square error (RMSE) and ratio of the performance to deviation(RPD). The results show that after augmenting the real training dataset with samples generated by VAE, the predictive capabilities of the four models improved to varying degrees, and the models' overfitting problems were effectively alleviated. The RPD value of the PLSR model ameliorated from 1.682 to 2.226 after using the generated sample. Meanwhile the RPD values of the remaining three machine learning models (SVR, RBF, GBM) are raised above 3.000. Notably, the GBM model demonstrated the most significant performance improvement, with its RPD value increasing from 1.566 to 3.326. What's more, the number of generated samples affects the prediction accuracy of the model. On the one hand, too few generated samples make the prediction accuracy of the model unsatisfactory. On the other hand, too many generated samples will lead to a decline in the prediction performance of the model. When the VAE network is at the 16000th iteration, the generated samples are highly similar to the real training data set. The average structural similarity index measure and average peak-signal-to-noise ratio obtained are 0.972 and 20.558 dB respectively, and the Pearson correlation coefficient is 0.861. The generated samples and real samples have significantly strong correlations. After the training data set was increased, the model with the best prediction performance was SVR. The R², RMSE, and RPD of the validation set were 0.923, 72.187 mg kg⁻¹, and 3.611 respectively. The number of extracted feature bands was 25, and the expansion included an additional 5 samples. In the meantime, the model with the largest improvement in predictive performance is GBM whose validation set R² improves by 0.318, RMSE decreases by 88.044 mg kg⁻¹, and RPD improves by 1.760. This study proves that the data augmentation method based on VAE can effectively improve the feasibility of machine learning algorithms in predicting soil heavy metal arsenic content, and provides a new idea for improving model prediction performance without increasing sampling costs.

查看原文本刊更多论文

基于变分自编码器生成样本与机器学习相结合的土壤AS含量定量预测

本研究旨在提高土壤砷含量的预测精度，目前受样本数据有限的限制。为了解决这一限制，我们提出了一种使用变分自编码器（VAE）来生成额外样本以增强原始训练数据集的方法。以云南省污染农田土壤样品为研究对象，对该方法进行了验证。采用Savitzky-Golay （SG）平滑和标准正态变量（SNV）对高光谱数据进行预处理，并通过逐次投影算法（SPA）提取特征波段。在建模方面，采用PLSR、SVR、RBF、GBM 4种机器学习模型建立了土壤砷含量预测模型。采用决定系数（R2）、均方根误差（RMSE）和性能偏差比（RPD） 3个指标评价模型的预测能力。结果表明，用VAE生成的样本对真实训练数据集进行扩充后，四种模型的预测能力都有不同程度的提高，模型的过拟合问题得到了有效缓解。使用生成的样本后，PLSR模型的RPD值由1.682改善到2.226。同时，其余三种机器学习模型（SVR、RBF、GBM）的RPD值均提高到3000以上。值得注意的是，GBM模型的性能提升最为显著，RPD值从1.566增加到3.326。此外，生成样本的数量会影响模型的预测精度。一方面，生成的样本太少，使得模型的预测精度不能令人满意。另一方面，生成的样本过多会导致模型的预测性能下降。当VAE网络在第16000次迭代时，生成的样本与真实训练数据集高度相似。得到的平均结构相似指数测量值和平均峰值信噪比分别为0.972和20.558 dB， Pearson相关系数为0.861。生成的样本与真实样本具有显著的强相关性。增加训练数据集后，预测效果最好的模型是SVR。验证集的R2、RMSE和RPD分别为0.923、72.187 mg kg−1和3.611。提取的特征频带数量为25个，扩展后增加了5个样本。同时，预测性能提高最大的模型是GBM，其验证集R2提高了0.318，RMSE降低了88.044 mg kg - 1， RPD提高了1.760。本研究证明了基于VAE的数据增强方法可以有效提高机器学习算法预测土壤重金属砷含量的可行性，为在不增加采样成本的情况下提高模型预测性能提供了新的思路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Chemometrics and Intelligent Laboratory Systems 工程技术-分析化学

CiteScore

7.50

自引率

7.70%

发文量

169

审稿时长

3.4 months

期刊介绍： Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines. Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data. The journal deals with the following topics: 1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.) 2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered. 3) Development of new software that provides novel tools or truly advances the use of chemometrical methods. 4) Well characterized data sets to test performance for the new methods and software. The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.