{"title":"Improving Machine Learning Classification Predictions through SHAP and Features Analysis Interpretation.","authors":"Leonardo Bernal,Giulio Rastelli,Luca Pinzi","doi":"10.1021/acs.jcim.5c02015","DOIUrl":null,"url":null,"abstract":"Tree-based machine learning (ML) algorithms, such as Extra Trees (ET), Random Forest (RF), Gradient Boosting Machine (GBM), and XGBoost (XGB) are among the most widely used in early drug discovery, given their versatility and performance. However, models based on these algorithms often suffer from misclassification and reduced interpretability issues, which limit their applicability in practice. To address these challenges, several approaches have been proposed, including the use of SHapley Additive Explanations (SHAP). While SHAP values are commonly used to elucidate the importance of features driving models' predictions, they can also be employed in strategies to improve their prediction performance. Building on these premises, we propose a novel approach that integrates SHAP and features value analyses to reduce misclassification in model predictions. Specifically, we benchmarked classifiers based on ET, RF, GBM, and XGB algorithms using data sets of compounds with known antiproliferative activity against three prostate cancer (PC) cell lines (i.e., PC3, LNCaP, and DU-145). The best-performing models, based on RDKit and ECFP4 descriptors with GBM and XGB algorithms, achieved MCC values above 0.58 and F1-score above 0.8 across all data sets, demonstrating satisfactory accuracy and precision. Analyses of SHAP values revealed that many misclassified compounds possess feature values that fall within the range typically associated with the opposite class. Based on these findings, we developed a misclassification-detection framework using four filtering rules, which we termed \"RAW\", SHAP, \"RAW OR SHAP\", and \"RAW AND SHAP\". These filtering rules successfully identified several potentially misclassified predictions, with the \"RAW OR SHAP\" rule retrieving up to 21%, 23%, and 63% of misclassified compounds in the PC3, DU-145, and LNCaP test sets, respectively. The developed flagging rules enable the systematic exclusion of likely misclassified compounds, even across progressively higher prediction confidence levels, thus providing a valuable approach to improve classifier performance in virtual screening applications.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"22 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c02015","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0
Abstract
Tree-based machine learning (ML) algorithms, such as Extra Trees (ET), Random Forest (RF), Gradient Boosting Machine (GBM), and XGBoost (XGB) are among the most widely used in early drug discovery, given their versatility and performance. However, models based on these algorithms often suffer from misclassification and reduced interpretability issues, which limit their applicability in practice. To address these challenges, several approaches have been proposed, including the use of SHapley Additive Explanations (SHAP). While SHAP values are commonly used to elucidate the importance of features driving models' predictions, they can also be employed in strategies to improve their prediction performance. Building on these premises, we propose a novel approach that integrates SHAP and features value analyses to reduce misclassification in model predictions. Specifically, we benchmarked classifiers based on ET, RF, GBM, and XGB algorithms using data sets of compounds with known antiproliferative activity against three prostate cancer (PC) cell lines (i.e., PC3, LNCaP, and DU-145). The best-performing models, based on RDKit and ECFP4 descriptors with GBM and XGB algorithms, achieved MCC values above 0.58 and F1-score above 0.8 across all data sets, demonstrating satisfactory accuracy and precision. Analyses of SHAP values revealed that many misclassified compounds possess feature values that fall within the range typically associated with the opposite class. Based on these findings, we developed a misclassification-detection framework using four filtering rules, which we termed "RAW", SHAP, "RAW OR SHAP", and "RAW AND SHAP". These filtering rules successfully identified several potentially misclassified predictions, with the "RAW OR SHAP" rule retrieving up to 21%, 23%, and 63% of misclassified compounds in the PC3, DU-145, and LNCaP test sets, respectively. The developed flagging rules enable the systematic exclusion of likely misclassified compounds, even across progressively higher prediction confidence levels, thus providing a valuable approach to improve classifier performance in virtual screening applications.
基于树的机器学习(ML)算法,如额外树(ET)、随机森林(RF)、梯度增强机(GBM)和XGBoost (XGB),由于其通用性和性能,是早期药物发现中应用最广泛的算法之一。然而,基于这些算法的模型往往存在误分类和可解释性降低的问题,限制了它们在实践中的适用性。为了应对这些挑战,已经提出了几种方法,包括使用SHapley加性解释(SHAP)。虽然SHAP值通常用于阐明驱动模型预测的特征的重要性,但它们也可以用于提高模型预测性能的策略。在这些前提下,我们提出了一种集成SHAP和特征值分析的新方法,以减少模型预测中的错误分类。具体来说,我们基于ET、RF、GBM和XGB算法对分类器进行基准测试,使用已知对三种前列腺癌(PC)细胞系(即PC3、LNCaP和DU-145)具有抗增殖活性的化合物数据集。基于RDKit和ECFP4描述符以及GBM和XGB算法的表现最好的模型在所有数据集上的MCC值都在0.58以上,f1得分在0.8以上,显示出令人满意的准确度和精度。对SHAP值的分析表明,许多错误分类的化合物具有的特征值落在与相反类通常相关的范围内。基于这些发现,我们开发了一个使用四个过滤规则的错误分类检测框架,我们将其称为“RAW”、“SHAP”、“RAW OR SHAP”和“RAW and SHAP”。这些过滤规则成功地识别了几个潜在的错误分类预测,其中“RAW OR SHAP”规则分别在PC3、DU-145和LNCaP测试集中检索了高达21%、23%和63%的错误分类化合物。开发的标记规则能够系统地排除可能的错误分类化合物,甚至在逐步提高的预测置信度水平上,从而提供了一种有价值的方法来提高虚拟筛选应用中的分类器性能。
期刊介绍:
The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery.
Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field.
As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.