Yuan Liu, Ruyu Fu, Ying Xue, Mengjia Li, Xuedong Wang
{"title":"A QSAR-machine learning hybrid model for predicting the ecotoxicity of soil organic compounds and deriving thresholds","authors":"Yuan Liu, Ruyu Fu, Ying Xue, Mengjia Li, Xuedong Wang","doi":"10.1016/j.jclepro.2026.147869","DOIUrl":null,"url":null,"abstract":"<div><div>Soil organic pollution threatens the integrity of ecosystems. Traditional ecotoxicity assessments face data scarcity and are also limited by linear models. This study developed a machine learning-quantitative structure-activity relationship (ML-QSAR) model, which integrated 2108 toxicity data points (77 species, 305 compounds) and incorporated molecular descriptors derived from density functional theory (DFT). The ecological thresholds were derived via species sensitivity distribution (SSD). The results indicated that the Random Forest (RF) algorithm outperformed XGBoost and CatBoost, with a training/test R<sup>2</sup> of 0.968/0.824. The external validation showed that 95.9% of the predictions error were within 1.5-fold error. Global feature analysis identified entropy, dipole moment (<em>μ</em>), and soil moisture as core driving features. Entropy regulated toxicity via a threshold effect of 744.5 J/(mol·K), and it increased toxicity by 2.3 times in low entropy ranges. There is a significant interaction between dipole moment (<em>μ</em>) and soil moisture. The toxicity increased by 2.3 times under combined conditions of <em>μ</em> > 4.4 Debye and soil moisture >31.7%. Toxicity is modulated by the interaction of soil silt content and 22 parameters. The goodness-of-fit value of the SSD curve constructed from model predictions exceeded 0.91. The derived ecological safety threshold (PNEC) for dinitrotoluene was 5.498 mg/kg, which is far lower than that for anthracene oil, hexabromocyclododecane, and perfluorooctanoic acid, and is therefore considered the highest risk pollutant. This framework overcomes linear limitations of traditional QSAR models, and provides a high-throughput tool for soil contaminant risk screening.</div></div>","PeriodicalId":349,"journal":{"name":"Journal of Cleaner Production","volume":"549 ","pages":"Article 147869"},"PeriodicalIF":10.0000,"publicationDate":"2026-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cleaner Production","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0959652626004087","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/26 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Soil organic pollution threatens the integrity of ecosystems. Traditional ecotoxicity assessments face data scarcity and are also limited by linear models. This study developed a machine learning-quantitative structure-activity relationship (ML-QSAR) model, which integrated 2108 toxicity data points (77 species, 305 compounds) and incorporated molecular descriptors derived from density functional theory (DFT). The ecological thresholds were derived via species sensitivity distribution (SSD). The results indicated that the Random Forest (RF) algorithm outperformed XGBoost and CatBoost, with a training/test R2 of 0.968/0.824. The external validation showed that 95.9% of the predictions error were within 1.5-fold error. Global feature analysis identified entropy, dipole moment (μ), and soil moisture as core driving features. Entropy regulated toxicity via a threshold effect of 744.5 J/(mol·K), and it increased toxicity by 2.3 times in low entropy ranges. There is a significant interaction between dipole moment (μ) and soil moisture. The toxicity increased by 2.3 times under combined conditions of μ > 4.4 Debye and soil moisture >31.7%. Toxicity is modulated by the interaction of soil silt content and 22 parameters. The goodness-of-fit value of the SSD curve constructed from model predictions exceeded 0.91. The derived ecological safety threshold (PNEC) for dinitrotoluene was 5.498 mg/kg, which is far lower than that for anthracene oil, hexabromocyclododecane, and perfluorooctanoic acid, and is therefore considered the highest risk pollutant. This framework overcomes linear limitations of traditional QSAR models, and provides a high-throughput tool for soil contaminant risk screening.
期刊介绍:
The Journal of Cleaner Production is an international, transdisciplinary journal that addresses and discusses theoretical and practical Cleaner Production, Environmental, and Sustainability issues. It aims to help societies become more sustainable by focusing on the concept of 'Cleaner Production', which aims at preventing waste production and increasing efficiencies in energy, water, resources, and human capital use. The journal serves as a platform for corporations, governments, education institutions, regions, and societies to engage in discussions and research related to Cleaner Production, environmental, and sustainability practices.