Critical insights into data curation and label noise for accurate prediction of aerobic biodegradability of organic chemicals†

IF 4.3 3区环境科学与生态学 Q1 CHEMISTRY, ANALYTICAL

Environmental Science: Processes & Impacts Pub Date : 2024-09-26 DOI:10.1039/D4EM00431K

Paulina Körner, Juliane Glüge, Stefan Glüge and Martin Scheringer

{"title":"Critical insights into data curation and label noise for accurate prediction of aerobic biodegradability of organic chemicals†","authors":"Paulina Körner, Juliane Glüge, Stefan Glüge and Martin Scheringer","doi":"10.1039/D4EM00431K","DOIUrl":null,"url":null,"abstract":"<p >The focus of this work is to enhance state-of-the-art Machine Learning (ML) models that can predict the aerobic biodegradability of organic chemicals through a data-centric approach. To do that, an already existing dataset that was previously used to train ML models was analyzed for mismatching chemical identifiers and data leakage between test and training set and the detected errors were corrected. Chemicals with high variance between study results were removed and an XGBoost was trained on the dataset. Despite extensive data curation, only marginal improvement was achieved in the classification model's performance. This was attributed to three potential reasons: (1) a significant number of data labels were noisy, (2) the features could not sufficiently represent the chemicals, and/or (3) the model struggled to learn and generalize effectively. All three potential reasons were examined and point (1) seemed to be the most decisive one that prevented the model from generating more accurate results. Removing data points with possibly noisy labels by performing label noise filtering using two other predictive models increased the classification model's balanced accuracy from 80.9% to 94.2%. The new classifier is therefore better than any previously developed classification model for ready biodegradation. The examination of the key characteristics (molecular weight of the substances, proportion of halogens present and distribution of degradation labels) and the applicability domain indicate that no/not a large share of difficult-to-learn substances has been removed in the label noise filtering, meaning that the final model is still very robust.</p>","PeriodicalId":74,"journal":{"name":"Environmental Science: Processes & Impacts","volume":" 10","pages":" 1780-1795"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/em/d4em00431k?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Science: Processes & Impacts","FirstCategoryId":"93","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2024/em/d4em00431k","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}

引用次数: 0

Abstract

The focus of this work is to enhance state-of-the-art Machine Learning (ML) models that can predict the aerobic biodegradability of organic chemicals through a data-centric approach. To do that, an already existing dataset that was previously used to train ML models was analyzed for mismatching chemical identifiers and data leakage between test and training set and the detected errors were corrected. Chemicals with high variance between study results were removed and an XGBoost was trained on the dataset. Despite extensive data curation, only marginal improvement was achieved in the classification model's performance. This was attributed to three potential reasons: (1) a significant number of data labels were noisy, (2) the features could not sufficiently represent the chemicals, and/or (3) the model struggled to learn and generalize effectively. All three potential reasons were examined and point (1) seemed to be the most decisive one that prevented the model from generating more accurate results. Removing data points with possibly noisy labels by performing label noise filtering using two other predictive models increased the classification model's balanced accuracy from 80.9% to 94.2%. The new classifier is therefore better than any previously developed classification model for ready biodegradation. The examination of the key characteristics (molecular weight of the substances, proportion of halogens present and distribution of degradation labels) and the applicability domain indicate that no/not a large share of difficult-to-learn substances has been removed in the label noise filtering, meaning that the final model is still very robust.

Abstract Image

查看原文本刊更多论文

对数据整理和标签噪声的重要见解，以准确预测有机化学品的好氧生物降解性。

这项工作的重点是通过一种以数据为中心的方法来增强最先进的机器学习（ML）模型，从而预测有机化学品的有氧生物降解性。为此，对以前用于训练 ML 模型的现有数据集进行了分析，以确定测试集和训练集之间是否存在化学品标识符不匹配和数据泄漏问题，并对检测到的错误进行了纠正。研究结果之间差异较大的化学物质被删除，并在数据集上进行了 XGBoost 训练。尽管进行了大量的数据整理工作，但分类模型的性能仅有微弱的提高。这可能有三个原因：(1) 大量数据标签存在噪声，(2) 特征不能充分代表化学品，和/或 (3) 模型难以有效学习和泛化。我们对这三种可能的原因都进行了研究，其中第（1）点似乎是阻碍模型生成更准确结果的最主要原因。通过使用其他两个预测模型进行标签噪声过滤，去除可能存在噪声标签的数据点，分类模型的平衡准确率从 80.9% 提高到 94.2%。因此，新的分类器比以前开发的任何生物降解分类模型都要好。对关键特征（物质的分子量、卤素的存在比例和降解标签的分布）和适用领域的研究表明，在标签噪声过滤过程中没有/没有去除大量难以学习的物质，这意味着最终模型仍然非常稳健。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Environmental Science: Processes & Impacts CHEMISTRY, ANALYTICAL-ENVIRONMENTAL SCIENCES

CiteScore

9.50

自引率

3.60%

发文量

202

审稿时长

1 months

期刊介绍： Environmental Science: Processes & Impacts publishes high quality papers in all areas of the environmental chemical sciences, including chemistry of the air, water, soil and sediment. We welcome studies on the environmental fate and effects of anthropogenic and naturally occurring contaminants, both chemical and microbiological, as well as related natural element cycling processes.