利用随机森林、支持向量机、AutoGluon-Tabular和H2O AutoML解决药物发现和开发中的不平衡分类问题。

IF 5.6 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2025-04-15 DOI:10.1021/acs.jcim.5c00023

Ayush Garg,Narayanan Ramamurthi,Shyam Sundar Das

{"title":"利用随机森林、支持向量机、AutoGluon-Tabular和H2O AutoML解决药物发现和开发中的不平衡分类问题。","authors":"Ayush Garg,Narayanan Ramamurthi,Shyam Sundar Das","doi":"10.1021/acs.jcim.5c00023","DOIUrl":null,"url":null,"abstract":"The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques─(a) threshold optimization using (i) GHOST and (ii) the area under the precision-recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomek─and generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"50 1","pages":""},"PeriodicalIF":5.6000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.\",\"authors\":\"Ayush Garg,Narayanan Ramamurthi,Shyam Sundar Das\",\"doi\":\"10.1021/acs.jcim.5c00023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques─(a) threshold optimization using (i) GHOST and (ii) the area under the precision-recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomek─and generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"50 1\",\"pages\":\"\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2025-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.jcim.5c00023\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c00023","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

摘要

在类不平衡数据集上建立的分类模型往往优先考虑多数类的准确率，因此，少数类通常有较高的误分类率。有不同的技术可用于解决分类模型中的类不平衡问题，可分为数据级、算法级和混合方法。但据我们所知，在文献中没有对这些技术对班级比例的表现进行深入分析。我们在本研究中解决了这些缺点，并使用机器学习（ML）方法和AutoML工具对四种不同技术的性能进行了详细分析，以解决不平衡的类分布。为了开展我们的研究，我们选择了四种这样的技术──(a)使用(i) GHOST和（ii） precision-recall curve （AUPR）曲线下面积的阈值优化，(b) AutoML和机器学习方法的类权的内部平衡方法，以及(c)使用SMOTETomek的数据平衡──并从属于药物发现和开发领域的三个数据集中生成了27个考虑9个不同类比（即正类和总样本的比例）的数据集。我们使用随机森林（random forest， RF）和支持向量机（support vector machine， SVM）作为ML分类器的代表，使用AutoGluon-Tabular（版本0.6.1）和H2O AutoML（版本3.40.0.4）作为AutoML工具的代表。本研究的重要发现如下：(1)阈值优化对AUC和AUPR等排名指标没有影响，但AUC和AUPR受到类别加权和SMOTTomek的影响；（ii）机器学习方法RF和SVM在所有数据集上的F1分数、MCC和平衡精度分别提高了375、33.33和450个显著百分比，适用于不平衡数据集的性能评估；（iii）对于AutoML库AutoGluon-Tabular和H2O AutoML，在所有数据集上，F1分数、MCC和平衡精度分别可以实现383.33、37.25和533.33的显著提高；（iv）平衡准确度百分比改善的一般模式是，当类别比率系统地由0.5降至0.1时，百分比改善会增加；在F1分数和MCC的情况下，班级比例为0.3时改善最大；(v)对于具有平衡的ML和AutoML，可以观察到，任何单独的类平衡技术都不会在基于F1分数的数据集数量显著增加的情况下优于所有其他方法；（vi）三种外部平衡技术的组合优于ML和AutoML的内部平衡方法；（vii） AutoML工具的表现与ML模型一样好，在某些情况下，当应用不平衡处理技术时，处理不平衡分类的表现甚至更好。总之，建议探索多种数据平衡技术来对不平衡数据集进行分类，以获得最佳性能，因为外部技术和内部技术都没有明显优于其他技术。结果是特定于本研究中使用的ML方法和AutoML库的，为了推广，可以考虑相当数量的ML方法和AutoML库进行研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.

The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques─(a) threshold optimization using (i) GHOST and (ii) the area under the precision-recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomek─and generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.