Machine learning-based prediction of fish acute mortality: implementation, interpretation, and regulatory relevance†

IF 4.4 Q3 ENGINEERING, ENVIRONMENTAL

Environmental science. Advances Pub Date : 2024-06-03 DOI:10.1039/D4VA00072B

Lilian Gasser, Christoph Schür, Fernando Perez-Cruz, Kristin Schirmer and Marco Baity-Jesi

{"title":"Machine learning-based prediction of fish acute mortality: implementation, interpretation, and regulatory relevance†","authors":"Lilian Gasser, Christoph Schür, Fernando Perez-Cruz, Kristin Schirmer and Marco Baity-Jesi","doi":"10.1039/D4VA00072B","DOIUrl":null,"url":null,"abstract":"Regulation of chemicals requires knowledge of their toxicological effects on a large number of species, which has traditionally been acquired through in vivo testing. The recent effort to find alternatives based on machine learning, however, has not focused on guaranteeing transparency, comparability and reproducibility, which makes it difficult to assess advantages and disadvantages of these methods. Also, comparable baseline performances are needed. In this study, we trained regression models on the ADORE “t-F2F” challenge proposed in [Schür et al., Nature Scientific data, 2023] to predict acute mortality, measured as LC50 (lethal concentration 50), of organic compounds on fishes. We trained LASSO, random forest (RF), XGBoost, Gaussian process (GP) regression models, and found a series of aspects that are stable across models: (i) using mass or molar concentrations does not affect performances; (ii) the performances are only weakly dependent on the molecular representations of the chemicals, but (iii) strongly on how the data is split. Overall, the tree-based models RF and XGBoost performed best and we were able to predict the log10-transformed LC50 with a root mean square error of 0.90, which corresponds to an order of magnitude on the original LC50 scale. On a local level, on the other hand, the models are not able to consistently predict the toxicity of individual chemicals accurately enough. Predictions for single chemicals are mostly influenced by a few chemical properties while taxonomic traits are not captured sufficiently by the models. We discuss technical and conceptual improvements for these challenges to enhance the suitability of in silico methods to environmental hazard assessment. Accordingly, this work showcases state-of-the-art models and contributes to the ongoing discussion on regulatory integration.","PeriodicalId":72941,"journal":{"name":"Environmental science. Advances","volume":" 8","pages":" 1124-1138"},"PeriodicalIF":4.4000,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/va/d4va00072b?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental science. Advances","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2024/va/d4va00072b","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Regulation of chemicals requires knowledge of their toxicological effects on a large number of species, which has traditionally been acquired through in vivo testing. The recent effort to find alternatives based on machine learning, however, has not focused on guaranteeing transparency, comparability and reproducibility, which makes it difficult to assess advantages and disadvantages of these methods. Also, comparable baseline performances are needed. In this study, we trained regression models on the ADORE “t-F2F” challenge proposed in [Schür et al., Nature Scientific data, 2023] to predict acute mortality, measured as LC50 (lethal concentration 50), of organic compounds on fishes. We trained LASSO, random forest (RF), XGBoost, Gaussian process (GP) regression models, and found a series of aspects that are stable across models: (i) using mass or molar concentrations does not affect performances; (ii) the performances are only weakly dependent on the molecular representations of the chemicals, but (iii) strongly on how the data is split. Overall, the tree-based models RF and XGBoost performed best and we were able to predict the log10-transformed LC50 with a root mean square error of 0.90, which corresponds to an order of magnitude on the original LC50 scale. On a local level, on the other hand, the models are not able to consistently predict the toxicity of individual chemicals accurately enough. Predictions for single chemicals are mostly influenced by a few chemical properties while taxonomic traits are not captured sufficiently by the models. We discuss technical and conceptual improvements for these challenges to enhance the suitability of in silico methods to environmental hazard assessment. Accordingly, this work showcases state-of-the-art models and contributes to the ongoing discussion on regulatory integration.

Abstract Image

查看原文本刊更多论文

基于机器学习的鱼类急性死亡率预测：实施、解释和监管相关性

对化学品的监管需要了解其对大量物种的毒理影响，而这种影响传统上是通过体内测试获得的。然而，最近基于机器学习寻找替代方法的努力并没有把重点放在保证透明度、可比性和可重复性上，因此很难评估这些方法的优缺点。此外，还需要可比较的基准性能。在本研究中，我们根据[Schür 等人，《自然科学数据》，2023 年]中提出的 ADORE "t-F2F "挑战训练了回归模型，以预测有机化合物对鱼类的急性死亡率，即 LC50（致死浓度 50）。我们对 LASSO、随机森林 (RF)、XGBoost 和高斯过程 (GP) 回归模型进行了训练，发现不同模型之间存在一系列稳定的方面：(i) 使用质量或摩尔浓度不会影响性能；(ii) 性能仅微弱依赖于化学物质的分子表征，但 (iii) 强烈依赖于数据的分割方式。总体而言，基于树的模型 RF 和 XGBoost 表现最佳，我们能够以 0.90 的均方根误差预测对数 10 转换后的 LC50，这相当于原始 LC50 标度的一个数量级。另一方面，在局部水平上，模型无法持续准确地预测单个化学品的毒性。对单种化学品的预测主要受少数化学特性的影响，而分类学特征则未被模型充分捕捉。针对这些挑战，我们讨论了技术和概念上的改进，以提高硅学方法在环境危害评估中的适用性。因此，这项工作展示了最先进的模型，并为正在进行的监管整合讨论做出了贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Environmental science. Advances

CiteScore

1.90

自引率

0.00%

发文量