Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations.

IF 3.9 3区环境科学与生态学 Q2 ENVIRONMENTAL SCIENCES

Toxics Pub Date : 2025-07-10 DOI:10.3390/toxics13070579

Yuan Zhang, Yanting Li, Yang Li, Lin Zhao, Yongkui Yang

{"title":"Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations.","authors":"Yuan Zhang, Yanting Li, Yang Li, Lin Zhao, Yongkui Yang","doi":"10.3390/toxics13070579","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (R2 = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (R2 = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.","PeriodicalId":23195,"journal":{"name":"Toxics","volume":"13 7","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Toxics","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.3390/toxics13070579","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (R² = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (R² = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.

查看原文本刊更多论文

可解释的机器学习模型和符号回归揭示了植物中全氟和多氟烷基物质（PFASs）的转移：一种新的小数据机器学习方法来增加数据并获得预测方程。

机器学习（ML）技术在模拟植物系统中污染物的传输方面变得越来越有价值。然而，当使用ML预测水培系统中的迁移时，仍然存在两个挑战（小样本量和缺乏定量计算功能）。针对全氟烷基和多氟烷基物质的生物积累，研究了基于数据增强、ML和符号回归的关键因素和定量计算方程。首先，对数据预处理后的输入数据进行特征展开；最重要的一步是数据增强。将合成少数派过采样技术与变分自编码器相结合，将原始训练集扩展了9倍。随后，将四个ML模型应用于测试集，以预测所选的输出参数。CatBoost预测准确率最高（R2 = 0.83）。Shapley加性解释值表明分子量和暴露时间是最重要的参数。在原始数据和增广数据的基础上，应用三种符号回归模型得到准确的预测方程。在增广数据基础上，高维稀疏相互作用方程精度最高（R2 = 0.776）。我们的研究结果表明，这种方法可以为植物根系的吸收和积累提供重要的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Toxics Chemical Engineering-Chemical Health and Safety

CiteScore

4.50

自引率

10.90%

发文量

681

审稿时长

6 weeks

期刊介绍： Toxics (ISSN 2305-6304) is an international, peer-reviewed, open access journal which provides an advanced forum for studies related to all aspects of toxic chemicals and materials. It publishes reviews, regular research papers, and short communications. Our aim is to encourage scientists to publish their experimental and theoretical results in detail. There is, therefore, no restriction on the maximum length of the papers, although authors should write their papers in a clear and concise way. The full experimental details must be provided so that the results can be reproduced. Electronic files or software regarding the full details of calculations and experimental procedure can be deposited as supplementary material, if it is not possible to publish them along with the text.