Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations.

IF 3.9 3区 环境科学与生态学 Q2 ENVIRONMENTAL SCIENCES
Toxics Pub Date : 2025-07-10 DOI:10.3390/toxics13070579
Yuan Zhang, Yanting Li, Yang Li, Lin Zhao, Yongkui Yang
{"title":"Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations.","authors":"Yuan Zhang, Yanting Li, Yang Li, Lin Zhao, Yongkui Yang","doi":"10.3390/toxics13070579","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (<i>R</i><sup>2</sup> = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (<i>R</i><sup>2</sup> = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.</p>","PeriodicalId":23195,"journal":{"name":"Toxics","volume":"13 7","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Toxics","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.3390/toxics13070579","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (R2 = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (R2 = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.

可解释的机器学习模型和符号回归揭示了植物中全氟和多氟烷基物质(PFASs)的转移:一种新的小数据机器学习方法来增加数据并获得预测方程。
机器学习(ML)技术在模拟植物系统中污染物的传输方面变得越来越有价值。然而,当使用ML预测水培系统中的迁移时,仍然存在两个挑战(小样本量和缺乏定量计算功能)。针对全氟烷基和多氟烷基物质的生物积累,研究了基于数据增强、ML和符号回归的关键因素和定量计算方程。首先,对数据预处理后的输入数据进行特征展开;最重要的一步是数据增强。将合成少数派过采样技术与变分自编码器相结合,将原始训练集扩展了9倍。随后,将四个ML模型应用于测试集,以预测所选的输出参数。CatBoost预测准确率最高(R2 = 0.83)。Shapley加性解释值表明分子量和暴露时间是最重要的参数。在原始数据和增广数据的基础上,应用三种符号回归模型得到准确的预测方程。在增广数据基础上,高维稀疏相互作用方程精度最高(R2 = 0.776)。我们的研究结果表明,这种方法可以为植物根系的吸收和积累提供重要的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Toxics
Toxics Chemical Engineering-Chemical Health and Safety
CiteScore
4.50
自引率
10.90%
发文量
681
审稿时长
6 weeks
期刊介绍: Toxics (ISSN 2305-6304) is an international, peer-reviewed, open access journal which provides an advanced forum for studies related to all aspects of toxic chemicals and materials. It publishes reviews, regular research papers, and short communications. Our aim is to encourage scientists to publish their experimental and theoretical results in detail. There is, therefore, no restriction on the maximum length of the papers, although authors should write their papers in a clear and concise way. The full experimental details must be provided so that the results can be reproduced. Electronic files or software regarding the full details of calculations and experimental procedure can be deposited as supplementary material, if it is not possible to publish them along with the text.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信