Critical insights into data curation and label noise for accurate prediction of aerobic biodegradability of organic chemicals†

IF 4.3 3区 环境科学与生态学 Q1 CHEMISTRY, ANALYTICAL
Paulina Körner, Juliane Glüge, Stefan Glüge and Martin Scheringer
{"title":"Critical insights into data curation and label noise for accurate prediction of aerobic biodegradability of organic chemicals†","authors":"Paulina Körner, Juliane Glüge, Stefan Glüge and Martin Scheringer","doi":"10.1039/D4EM00431K","DOIUrl":null,"url":null,"abstract":"<p >The focus of this work is to enhance state-of-the-art Machine Learning (ML) models that can predict the aerobic biodegradability of organic chemicals through a data-centric approach. To do that, an already existing dataset that was previously used to train ML models was analyzed for mismatching chemical identifiers and data leakage between test and training set and the detected errors were corrected. Chemicals with high variance between study results were removed and an XGBoost was trained on the dataset. Despite extensive data curation, only marginal improvement was achieved in the classification model's performance. This was attributed to three potential reasons: (1) a significant number of data labels were noisy, (2) the features could not sufficiently represent the chemicals, and/or (3) the model struggled to learn and generalize effectively. All three potential reasons were examined and point (1) seemed to be the most decisive one that prevented the model from generating more accurate results. Removing data points with possibly noisy labels by performing label noise filtering using two other predictive models increased the classification model's balanced accuracy from 80.9% to 94.2%. The new classifier is therefore better than any previously developed classification model for ready biodegradation. The examination of the key characteristics (molecular weight of the substances, proportion of halogens present and distribution of degradation labels) and the applicability domain indicate that no/not a large share of difficult-to-learn substances has been removed in the label noise filtering, meaning that the final model is still very robust.</p>","PeriodicalId":74,"journal":{"name":"Environmental Science: Processes & Impacts","volume":" 10","pages":" 1780-1795"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/em/d4em00431k?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Science: Processes & Impacts","FirstCategoryId":"93","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2024/em/d4em00431k","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0

Abstract

The focus of this work is to enhance state-of-the-art Machine Learning (ML) models that can predict the aerobic biodegradability of organic chemicals through a data-centric approach. To do that, an already existing dataset that was previously used to train ML models was analyzed for mismatching chemical identifiers and data leakage between test and training set and the detected errors were corrected. Chemicals with high variance between study results were removed and an XGBoost was trained on the dataset. Despite extensive data curation, only marginal improvement was achieved in the classification model's performance. This was attributed to three potential reasons: (1) a significant number of data labels were noisy, (2) the features could not sufficiently represent the chemicals, and/or (3) the model struggled to learn and generalize effectively. All three potential reasons were examined and point (1) seemed to be the most decisive one that prevented the model from generating more accurate results. Removing data points with possibly noisy labels by performing label noise filtering using two other predictive models increased the classification model's balanced accuracy from 80.9% to 94.2%. The new classifier is therefore better than any previously developed classification model for ready biodegradation. The examination of the key characteristics (molecular weight of the substances, proportion of halogens present and distribution of degradation labels) and the applicability domain indicate that no/not a large share of difficult-to-learn substances has been removed in the label noise filtering, meaning that the final model is still very robust.

Abstract Image

对数据整理和标签噪声的重要见解,以准确预测有机化学品的好氧生物降解性。
这项工作的重点是通过一种以数据为中心的方法来增强最先进的机器学习(ML)模型,从而预测有机化学品的有氧生物降解性。为此,对以前用于训练 ML 模型的现有数据集进行了分析,以确定测试集和训练集之间是否存在化学品标识符不匹配和数据泄漏问题,并对检测到的错误进行了纠正。研究结果之间差异较大的化学物质被删除,并在数据集上进行了 XGBoost 训练。尽管进行了大量的数据整理工作,但分类模型的性能仅有微弱的提高。这可能有三个原因:(1) 大量数据标签存在噪声,(2) 特征不能充分代表化学品,和/或 (3) 模型难以有效学习和泛化。我们对这三种可能的原因都进行了研究,其中第(1)点似乎是阻碍模型生成更准确结果的最主要原因。通过使用其他两个预测模型进行标签噪声过滤,去除可能存在噪声标签的数据点,分类模型的平衡准确率从 80.9% 提高到 94.2%。因此,新的分类器比以前开发的任何生物降解分类模型都要好。对关键特征(物质的分子量、卤素的存在比例和降解标签的分布)和适用领域的研究表明,在标签噪声过滤过程中没有/没有去除大量难以学习的物质,这意味着最终模型仍然非常稳健。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Environmental Science: Processes & Impacts
Environmental Science: Processes & Impacts CHEMISTRY, ANALYTICAL-ENVIRONMENTAL SCIENCES
CiteScore
9.50
自引率
3.60%
发文量
202
审稿时长
1 months
期刊介绍: Environmental Science: Processes & Impacts publishes high quality papers in all areas of the environmental chemical sciences, including chemistry of the air, water, soil and sediment. We welcome studies on the environmental fate and effects of anthropogenic and naturally occurring contaminants, both chemical and microbiological, as well as related natural element cycling processes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信