Hybrid Class Balancing Approach for Chemical Compound Toxicity Prediction.

Current computer-aided drug design Pub Date : 2024-09-24 DOI:10.2174/0115734099315538240909101737

Felipe Santiago-Gonzalez, Jose L Martinez-Rodriguez, Carlos García-Perez, Alfredo Juárez-Saldivar, Hugo E Camacho-Cruz

{"title":"Hybrid Class Balancing Approach for Chemical Compound Toxicity Prediction.","authors":"Felipe Santiago-Gonzalez, Jose L Martinez-Rodriguez, Carlos García-Perez, Alfredo Juárez-Saldivar, Hugo E Camacho-Cruz","doi":"10.2174/0115734099315538240909101737","DOIUrl":null,"url":null,"abstract":"Introduction: Computational methods are crucial for efficient and cost-effective drug toxicity prediction. Unfortunately, the data used for prediction is often imbalanced, resulting in biased models that favor the majority class. This paper proposes an approach to apply a hybrid class balancing technique and evaluate its performance on computational models for toxicity prediction in Tox21 datasets.Methods: The process begins by converting chemical compound data structures (SMILES strings) from various bioassay datasets into molecular descriptors that can be processed by algorithms. Subsequently, Undersampling and Oversampling techniques are applied in two different schemes on the training data. In the first scheme (Individual), only one balancing technique (Oversampling or Undersampling) is used. In the second scheme (Hybrid), the training data is divided according to a ratio (e.g., 90-10), applying a different balancing technique to each proportion. We considered eight resampling techniques (four Oversampling and four Undersampling), six molecular descriptors (based on MACCS, ECFP, and Mordred), and five classification models (KNN, MLP, RF, XGB and SVM) over 10 bioassay datasets to determine the configurations that yield the best performance.Results: We defined three testing scenarios: without balancing techniques (baseline), Individual, and Hybrid. We found that using the ENN technique in the MACCS-MLP combination resulted in a 10.01% improvement in performance. The increase for ECFP6-2048 was 16.47% after incorporating a combination of the SMOTE (10%) and RUS (90%) techniques. Meanwhile, using the same combination of techniques, MORDRED-XGB showed the most significant increase in performance, achieving a 22.62% improvement.Conclusion: Integrating any of the class balancing schemes resulted in a minimum of 10.01% improvement in prediction performance compared to the best baseline configuration. In this study, Undersampling techniques were more appropriate due to the significant overlap among samples. By eliminating specific samples from the predominant class that are close to the minority class, this overlap is greatly reduced.","PeriodicalId":93961,"journal":{"name":"Current computer-aided drug design","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current computer-aided drug design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/0115734099315538240909101737","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Computational methods are crucial for efficient and cost-effective drug toxicity prediction. Unfortunately, the data used for prediction is often imbalanced, resulting in biased models that favor the majority class. This paper proposes an approach to apply a hybrid class balancing technique and evaluate its performance on computational models for toxicity prediction in Tox21 datasets.

Methods: The process begins by converting chemical compound data structures (SMILES strings) from various bioassay datasets into molecular descriptors that can be processed by algorithms. Subsequently, Undersampling and Oversampling techniques are applied in two different schemes on the training data. In the first scheme (Individual), only one balancing technique (Oversampling or Undersampling) is used. In the second scheme (Hybrid), the training data is divided according to a ratio (e.g., 90-10), applying a different balancing technique to each proportion. We considered eight resampling techniques (four Oversampling and four Undersampling), six molecular descriptors (based on MACCS, ECFP, and Mordred), and five classification models (KNN, MLP, RF, XGB and SVM) over 10 bioassay datasets to determine the configurations that yield the best performance.

Results: We defined three testing scenarios: without balancing techniques (baseline), Individual, and Hybrid. We found that using the ENN technique in the MACCS-MLP combination resulted in a 10.01% improvement in performance. The increase for ECFP6-2048 was 16.47% after incorporating a combination of the SMOTE (10%) and RUS (90%) techniques. Meanwhile, using the same combination of techniques, MORDRED-XGB showed the most significant increase in performance, achieving a 22.62% improvement.

Conclusion: Integrating any of the class balancing schemes resulted in a minimum of 10.01% improvement in prediction performance compared to the best baseline configuration. In this study, Undersampling techniques were more appropriate due to the significant overlap among samples. By eliminating specific samples from the predominant class that are close to the minority class, this overlap is greatly reduced.

查看原文本刊更多论文

用于化学化合物毒性预测的混合类平衡方法。

引言计算方法对于高效、经济地预测药物毒性至关重要。遗憾的是，用于预测的数据往往是不平衡的，导致模型偏向于大多数类别。本文提出了一种应用混合类平衡技术的方法，并评估了其在 Tox21 数据集中用于毒性预测的计算模型的性能：方法：首先将各种生物测定数据集的化合物数据结构（SMILES 字符串）转换为可由算法处理的分子描述符。随后，在训练数据中采用两种不同的方案，即 "下采样 "和 "上采样 "技术。在第一种方案（单独）中，只使用一种平衡技术（过度取样或欠采样）。在第二种方案（混合方案）中，训练数据按照一定比例（如 90-10）进行划分，每个比例采用一种不同的平衡技术。我们在 10 个生物测定数据集上考虑了 8 种再采样技术（4 种 "过度采样 "和 4 种 "过度采样"）、6 种分子描述符（基于 MACCS、ECFP 和 Mordred）和 5 种分类模型（KNN、MLP、RF、XGB 和 SVM），以确定产生最佳性能的配置：我们确定了三种测试方案：不使用平衡技术（基线）、单独和混合。我们发现，在 MACCS-MLP 组合中使用 ENN 技术后，性能提高了 10.01%。在结合使用 SMOTE（10%）和 RUS（90%）技术后，ECFP6-2048 的性能提高了 16.47%。同时，使用相同的技术组合，MORDRED-XGB 的性能提升最为显著，达到了 22.62%：结论：与最佳基准配置相比，整合任何一种类平衡方案都能使预测性能至少提高 10.01%。在这项研究中，由于样本之间存在大量重叠，因此采用下采样技术更为合适。通过从主要类别中剔除接近少数类别的特定样本，可以大大减少重叠。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Current computer-aided drug design

自引率

0.00%

发文量