Classification models and SAR analysis of anaplastic lymphoma kinase (ALK) inhibitors using machine learning algorithms with two data division methods.

IF 3.9 2区 化学 Q2 CHEMISTRY, APPLIED
Dan Qu, Aixia Yan
{"title":"Classification models and SAR analysis of anaplastic lymphoma kinase (ALK) inhibitors using machine learning algorithms with two data division methods.","authors":"Dan Qu, Aixia Yan","doi":"10.1007/s11030-024-10990-x","DOIUrl":null,"url":null,"abstract":"<p><p>Anaplastic lymphoma kinase (ALK) plays a critical role in the development of various cancers. In this study, the dataset of 1810 collected inhibitors were divided into a training set and a test set by the self-organizing map (SOM) and random method, respectively. We developed 32 classification models using Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) to distinguish between highly and weakly active ALK inhibitors, with the inhibitors represented by MACCS and ECFP4 fingerprints. Model 7D which was built by the RF algorithm using training set 1/test set 1 divided by the SOM method, provided the best performance with a prediction accuracy of 90.97% and a Matthews correlation coefficient (MCC) value of 0.79 on the test set. We clustered the 1810 inhibitors into 10 subsets by K-Means algorithm to find out the structural characteristics of highly active ALK inhibitors. The main scaffolds of highly active ALK inhibitors were also analyzed based on ECFP4 fingerprints. It was found that some substructures have a significant effect on high activity, such as 2,4-diarylaminopyrimidine analogues, pyrrolo[2,1-f][1,2,4]triazin, indolo[2,3-b]quinoline-11-one, benzo[d]imidazol and pyrrolo[2,3-b]pyridine. In addition, the subsets were summarized into several clusters, among which four clusters showed a significant relationship with ALK inhibitory activity. Finally, Shapley additive explanations (SHAP) was also used to explain the influence of modeling features on model prediction results. The SHAP results indicated that our models can well reflect the structural features of ALK inhibitors.</p>","PeriodicalId":708,"journal":{"name":"Molecular Diversity","volume":" ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Diversity","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1007/s11030-024-10990-x","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}
引用次数: 0

Abstract

Anaplastic lymphoma kinase (ALK) plays a critical role in the development of various cancers. In this study, the dataset of 1810 collected inhibitors were divided into a training set and a test set by the self-organizing map (SOM) and random method, respectively. We developed 32 classification models using Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) to distinguish between highly and weakly active ALK inhibitors, with the inhibitors represented by MACCS and ECFP4 fingerprints. Model 7D which was built by the RF algorithm using training set 1/test set 1 divided by the SOM method, provided the best performance with a prediction accuracy of 90.97% and a Matthews correlation coefficient (MCC) value of 0.79 on the test set. We clustered the 1810 inhibitors into 10 subsets by K-Means algorithm to find out the structural characteristics of highly active ALK inhibitors. The main scaffolds of highly active ALK inhibitors were also analyzed based on ECFP4 fingerprints. It was found that some substructures have a significant effect on high activity, such as 2,4-diarylaminopyrimidine analogues, pyrrolo[2,1-f][1,2,4]triazin, indolo[2,3-b]quinoline-11-one, benzo[d]imidazol and pyrrolo[2,3-b]pyridine. In addition, the subsets were summarized into several clusters, among which four clusters showed a significant relationship with ALK inhibitory activity. Finally, Shapley additive explanations (SHAP) was also used to explain the influence of modeling features on model prediction results. The SHAP results indicated that our models can well reflect the structural features of ALK inhibitors.

使用两种数据分割方法的机器学习算法对无性淋巴瘤激酶 (ALK) 抑制剂进行分类模型和 SAR 分析。
无性淋巴瘤激酶(ALK)在各种癌症的发展中起着至关重要的作用。在这项研究中,我们利用自组织图(SOM)和随机方法将收集到的 1810 种抑制剂数据集分别分为训练集和测试集。我们利用支持向量机(SVM)、决策树(DT)、随机森林(RF)和极端梯度提升(XGBoost)开发了 32 个分类模型来区分高活性和弱活性 ALK 抑制剂,抑制剂用 MACCS 和 ECFP4 指纹表示。模型 7D 由 RF 算法建立,使用训练集 1/ 测试集 1 除以 SOM 方法,在测试集上的预测准确率为 90.97%,马修斯相关系数 (Matthews correlation coefficient, MCC) 值为 0.79,表现最佳。我们利用 K-Means 算法将 1810 种抑制剂聚类为 10 个子集,以发现高活性 ALK 抑制剂的结构特征。我们还根据 ECFP4 指纹对高活性 ALK 抑制剂的主要支架进行了分析。结果发现,一些亚结构对高活性有显著影响,如 2,4-二芳基氨基嘧啶类似物、吡咯并[2,1-f][1,2,4]三嗪、吲哚并[2,3-b]喹啉-11-酮、苯并[d]咪唑和吡咯并[2,3-b]吡啶。此外,还将这些子集归纳为几个簇,其中四个簇与 ALK 抑制活性有显著关系。最后,夏普利加法解释(SHAP)也被用来解释建模特征对模型预测结果的影响。SHAP结果表明,我们的模型能很好地反映ALK抑制剂的结构特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Molecular Diversity
Molecular Diversity 化学-化学综合
CiteScore
7.30
自引率
7.90%
发文量
219
审稿时长
2.7 months
期刊介绍: Molecular Diversity is a new publication forum for the rapid publication of refereed papers dedicated to describing the development, application and theory of molecular diversity and combinatorial chemistry in basic and applied research and drug discovery. The journal publishes both short and full papers, perspectives, news and reviews dealing with all aspects of the generation of molecular diversity, application of diversity for screening against alternative targets of all types (biological, biophysical, technological), analysis of results obtained and their application in various scientific disciplines/approaches including: combinatorial chemistry and parallel synthesis; small molecule libraries; microwave synthesis; flow synthesis; fluorous synthesis; diversity oriented synthesis (DOS); nanoreactors; click chemistry; multiplex technologies; fragment- and ligand-based design; structure/function/SAR; computational chemistry and molecular design; chemoinformatics; screening techniques and screening interfaces; analytical and purification methods; robotics, automation and miniaturization; targeted libraries; display libraries; peptides and peptoids; proteins; oligonucleotides; carbohydrates; natural diversity; new methods of library formulation and deconvolution; directed evolution, origin of life and recombination; search techniques, landscapes, random chemistry and more;
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信