Song et al.（2024）关于“基于分子指纹的机器学习模型预测PFAS在不同植物组织中的生物积累”的致编辑信。总环境，950 175091

IF 8.2 1区环境科学与生态学 Q1 ENVIRONMENTAL SCIENCES

Science of the Total Environment Pub Date : 2025-05-23 DOI:10.1016/j.scitotenv.2025.179714

Souichi Oka , Yoshiyasu Takefuji

{"title":"Song et al.（2024）关于“基于分子指纹的机器学习模型预测PFAS在不同植物组织中的生物积累”的致编辑信。总环境，950 175091","authors":"Souichi Oka , Yoshiyasu Takefuji","doi":"10.1016/j.scitotenv.2025.179714","DOIUrl":null,"url":null,"abstract":"<div><div>Song et al. (2024), “Prediction of PFAS bioaccumulation in different plant tissues with machine learning models based on molecular fingerprints,” employed machine learning methods, such as XGBoost and SHapley Additive exPlanations (SHAP), to predict PFAS bioaccumulation, reporting high predictive accuracy. However, this commentary critically examines their interpretation of feature importance, since high predictive accuracy does not guarantee reliable feature importance. Both XGBoost and SHAP are known to exhibit biases, such as overemphasizing features used in early splits and inheriting biases from the underlying model. Furthermore, the high dimensionality and potential collinearity of molecular fingerprints complicate SHAP interpretation, increasing overfitting risk and compromising SHAP value stability. To provide a general example, we conducted an independent simulation using a publicly available dataset of US industrial facilities and environmental compliance, demonstrating significant discrepancies between feature importance rankings from XGBoost and robust statistical tests. This commentary advocates for robust statistical methods coupled with <em>p</em>-values, including Spearman's rho, Kendall's tau, Goodman-Kruskal's gamma, Somers' delta, and Hoeffding's dependence, for feature selection. These non-parametric methods, which are independent of specific model assumptions and rely on data ranks, are better suited to capture complex relationships in high-dimensional data, providing a more reliable foundation for future PFAS bioaccumulation research.</div></div>","PeriodicalId":422,"journal":{"name":"Science of the Total Environment","volume":"984 ","pages":"Article 179714"},"PeriodicalIF":8.2000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Letter to the Editor regarding “Prediction of PFAS bioaccumulation in different plant tissues with machine learning models based on molecular fingerprints” by Song et al. (2024), Sci. Total Environ. 950 175091\",\"authors\":\"Souichi Oka , Yoshiyasu Takefuji\",\"doi\":\"10.1016/j.scitotenv.2025.179714\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Song et al. (2024), “Prediction of PFAS bioaccumulation in different plant tissues with machine learning models based on molecular fingerprints,” employed machine learning methods, such as XGBoost and SHapley Additive exPlanations (SHAP), to predict PFAS bioaccumulation, reporting high predictive accuracy. However, this commentary critically examines their interpretation of feature importance, since high predictive accuracy does not guarantee reliable feature importance. Both XGBoost and SHAP are known to exhibit biases, such as overemphasizing features used in early splits and inheriting biases from the underlying model. Furthermore, the high dimensionality and potential collinearity of molecular fingerprints complicate SHAP interpretation, increasing overfitting risk and compromising SHAP value stability. To provide a general example, we conducted an independent simulation using a publicly available dataset of US industrial facilities and environmental compliance, demonstrating significant discrepancies between feature importance rankings from XGBoost and robust statistical tests. This commentary advocates for robust statistical methods coupled with <em>p</em>-values, including Spearman's rho, Kendall's tau, Goodman-Kruskal's gamma, Somers' delta, and Hoeffding's dependence, for feature selection. These non-parametric methods, which are independent of specific model assumptions and rely on data ranks, are better suited to capture complex relationships in high-dimensional data, providing a more reliable foundation for future PFAS bioaccumulation research.</div></div>\",\"PeriodicalId\":422,\"journal\":{\"name\":\"Science of the Total Environment\",\"volume\":\"984 \",\"pages\":\"Article 179714\"},\"PeriodicalIF\":8.2000,\"publicationDate\":\"2025-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Science of the Total Environment\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0048969725013555\",\"RegionNum\":1,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENVIRONMENTAL SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science of the Total Environment","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0048969725013555","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

Song等人（2024），“基于分子指纹的机器学习模型预测PFAS在不同植物组织中的生物积累”，采用机器学习方法，如XGBoost和SHapley添加剂解释（SHAP）来预测PFAS的生物积累，报告了很高的预测准确性。然而，这篇评论批判性地检查了他们对特征重要性的解释，因为高预测准确性并不能保证可靠的特征重要性。众所周知，XGBoost和SHAP都存在偏差，比如过分强调早期拆分中使用的特性，以及从底层模型继承偏差。此外，分子指纹的高维性和潜在的共线性使SHAP解释复杂化，增加了过拟合风险，损害了SHAP值的稳定性。为了提供一个一般的例子，我们使用美国工业设施和环境合规性的公开数据集进行了独立的模拟，证明了XGBoost的特征重要性排名与稳健的统计测试之间存在显著差异。这篇评论提倡结合p值的稳健统计方法，包括Spearman的rho， Kendall的tau， Goodman-Kruskal的gamma， Somers的delta和Hoeffding的依赖性，用于特征选择。这些非参数方法不依赖于特定的模型假设，依赖于数据秩，更适合于捕捉高维数据中的复杂关系，为未来PFAS生物积累研究提供更可靠的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Letter to the Editor regarding “Prediction of PFAS bioaccumulation in different plant tissues with machine learning models based on molecular fingerprints” by Song et al. (2024), Sci. Total Environ. 950 175091

Song et al. (2024), “Prediction of PFAS bioaccumulation in different plant tissues with machine learning models based on molecular fingerprints,” employed machine learning methods, such as XGBoost and SHapley Additive exPlanations (SHAP), to predict PFAS bioaccumulation, reporting high predictive accuracy. However, this commentary critically examines their interpretation of feature importance, since high predictive accuracy does not guarantee reliable feature importance. Both XGBoost and SHAP are known to exhibit biases, such as overemphasizing features used in early splits and inheriting biases from the underlying model. Furthermore, the high dimensionality and potential collinearity of molecular fingerprints complicate SHAP interpretation, increasing overfitting risk and compromising SHAP value stability. To provide a general example, we conducted an independent simulation using a publicly available dataset of US industrial facilities and environmental compliance, demonstrating significant discrepancies between feature importance rankings from XGBoost and robust statistical tests. This commentary advocates for robust statistical methods coupled with p-values, including Spearman's rho, Kendall's tau, Goodman-Kruskal's gamma, Somers' delta, and Hoeffding's dependence, for feature selection. These non-parametric methods, which are independent of specific model assumptions and rely on data ranks, are better suited to capture complex relationships in high-dimensional data, providing a more reliable foundation for future PFAS bioaccumulation research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Science of the Total Environment 环境科学-环境科学

CiteScore

17.60

自引率

10.20%

发文量

8726

审稿时长

2.4 months

期刊介绍： The Science of the Total Environment is an international journal dedicated to scientific research on the environment and its interaction with humanity. It covers a wide range of disciplines and seeks to publish innovative, hypothesis-driven, and impactful research that explores the entire environment, including the atmosphere, lithosphere, hydrosphere, biosphere, and anthroposphere. The journal's updated Aims & Scope emphasizes the importance of interdisciplinary environmental research with broad impact. Priority is given to studies that advance fundamental understanding and explore the interconnectedness of multiple environmental spheres. Field studies are preferred, while laboratory experiments must demonstrate significant methodological advancements or mechanistic insights with direct relevance to the environment.