使用两种数据分割方法的机器学习算法对无性淋巴瘤激酶 (ALK) 抑制剂进行分类模型和 SAR 分析。

IF 3.8 2区化学 Q2 CHEMISTRY, APPLIED

Molecular Diversity Pub Date : 2025-08-01 Epub Date: 2024-11-12 DOI:10.1007/s11030-024-10990-x

Dan Qu, Aixia Yan

{"title":"使用两种数据分割方法的机器学习算法对无性淋巴瘤激酶 (ALK) 抑制剂进行分类模型和 SAR 分析。","authors":"Dan Qu, Aixia Yan","doi":"10.1007/s11030-024-10990-x","DOIUrl":null,"url":null,"abstract":"Anaplastic lymphoma kinase (ALK) plays a critical role in the development of various cancers. In this study, the dataset of 1810 collected inhibitors were divided into a training set and a test set by the self-organizing map (SOM) and random method, respectively. We developed 32 classification models using Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) to distinguish between highly and weakly active ALK inhibitors, with the inhibitors represented by MACCS and ECFP4 fingerprints. Model 7D which was built by the RF algorithm using training set 1/test set 1 divided by the SOM method, provided the best performance with a prediction accuracy of 90.97% and a Matthews correlation coefficient (MCC) value of 0.79 on the test set. We clustered the 1810 inhibitors into 10 subsets by K-Means algorithm to find out the structural characteristics of highly active ALK inhibitors. The main scaffolds of highly active ALK inhibitors were also analyzed based on ECFP4 fingerprints. It was found that some substructures have a significant effect on high activity, such as 2,4-diarylaminopyrimidine analogues, pyrrolo[2,1-f][1,2,4]triazin, indolo[2,3-b]quinoline-11-one, benzo[d]imidazol and pyrrolo[2,3-b]pyridine. In addition, the subsets were summarized into several clusters, among which four clusters showed a significant relationship with ALK inhibitory activity. Finally, Shapley additive explanations (SHAP) was also used to explain the influence of modeling features on model prediction results. The SHAP results indicated that our models can well reflect the structural features of ALK inhibitors.","PeriodicalId":708,"journal":{"name":"Molecular Diversity","volume":" ","pages":"2919-2943"},"PeriodicalIF":3.8000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Classification models and SAR analysis of anaplastic lymphoma kinase (ALK) inhibitors using machine learning algorithms with two data division methods.\",\"authors\":\"Dan Qu, Aixia Yan\",\"doi\":\"10.1007/s11030-024-10990-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Anaplastic lymphoma kinase (ALK) plays a critical role in the development of various cancers. In this study, the dataset of 1810 collected inhibitors were divided into a training set and a test set by the self-organizing map (SOM) and random method, respectively. We developed 32 classification models using Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) to distinguish between highly and weakly active ALK inhibitors, with the inhibitors represented by MACCS and ECFP4 fingerprints. Model 7D which was built by the RF algorithm using training set 1/test set 1 divided by the SOM method, provided the best performance with a prediction accuracy of 90.97% and a Matthews correlation coefficient (MCC) value of 0.79 on the test set. We clustered the 1810 inhibitors into 10 subsets by K-Means algorithm to find out the structural characteristics of highly active ALK inhibitors. The main scaffolds of highly active ALK inhibitors were also analyzed based on ECFP4 fingerprints. It was found that some substructures have a significant effect on high activity, such as 2,4-diarylaminopyrimidine analogues, pyrrolo[2,1-f][1,2,4]triazin, indolo[2,3-b]quinoline-11-one, benzo[d]imidazol and pyrrolo[2,3-b]pyridine. In addition, the subsets were summarized into several clusters, among which four clusters showed a significant relationship with ALK inhibitory activity. Finally, Shapley additive explanations (SHAP) was also used to explain the influence of modeling features on model prediction results. The SHAP results indicated that our models can well reflect the structural features of ALK inhibitors.\",\"PeriodicalId\":708,\"journal\":{\"name\":\"Molecular Diversity\",\"volume\":\" \",\"pages\":\"2919-2943\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Molecular Diversity\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1007/s11030-024-10990-x\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/11/12 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Diversity","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1007/s11030-024-10990-x","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/12 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}

引用次数: 0

摘要

无性淋巴瘤激酶（ALK）在各种癌症的发展中起着至关重要的作用。在这项研究中，我们利用自组织图（SOM）和随机方法将收集到的 1810 种抑制剂数据集分别分为训练集和测试集。我们利用支持向量机（SVM）、决策树（DT）、随机森林（RF）和极端梯度提升（XGBoost）开发了 32 个分类模型来区分高活性和弱活性 ALK 抑制剂，抑制剂用 MACCS 和 ECFP4 指纹表示。模型 7D 由 RF 算法建立，使用训练集 1/ 测试集 1 除以 SOM 方法，在测试集上的预测准确率为 90.97%，马修斯相关系数 (Matthews correlation coefficient, MCC) 值为 0.79，表现最佳。我们利用 K-Means 算法将 1810 种抑制剂聚类为 10 个子集，以发现高活性 ALK 抑制剂的结构特征。我们还根据 ECFP4 指纹对高活性 ALK 抑制剂的主要支架进行了分析。结果发现，一些亚结构对高活性有显著影响，如 2,4-二芳基氨基嘧啶类似物、吡咯并[2,1-f][1,2,4]三嗪、吲哚并[2,3-b]喹啉-11-酮、苯并[d]咪唑和吡咯并[2,3-b]吡啶。此外，还将这些子集归纳为几个簇，其中四个簇与 ALK 抑制活性有显著关系。最后，夏普利加法解释（SHAP）也被用来解释建模特征对模型预测结果的影响。SHAP结果表明，我们的模型能很好地反映ALK抑制剂的结构特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Classification models and SAR analysis of anaplastic lymphoma kinase (ALK) inhibitors using machine learning algorithms with two data division methods.

Anaplastic lymphoma kinase (ALK) plays a critical role in the development of various cancers. In this study, the dataset of 1810 collected inhibitors were divided into a training set and a test set by the self-organizing map (SOM) and random method, respectively. We developed 32 classification models using Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) to distinguish between highly and weakly active ALK inhibitors, with the inhibitors represented by MACCS and ECFP4 fingerprints. Model 7D which was built by the RF algorithm using training set 1/test set 1 divided by the SOM method, provided the best performance with a prediction accuracy of 90.97% and a Matthews correlation coefficient (MCC) value of 0.79 on the test set. We clustered the 1810 inhibitors into 10 subsets by K-Means algorithm to find out the structural characteristics of highly active ALK inhibitors. The main scaffolds of highly active ALK inhibitors were also analyzed based on ECFP4 fingerprints. It was found that some substructures have a significant effect on high activity, such as 2,4-diarylaminopyrimidine analogues, pyrrolo[2,1-f][1,2,4]triazin, indolo[2,3-b]quinoline-11-one, benzo[d]imidazol and pyrrolo[2,3-b]pyridine. In addition, the subsets were summarized into several clusters, among which four clusters showed a significant relationship with ALK inhibitory activity. Finally, Shapley additive explanations (SHAP) was also used to explain the influence of modeling features on model prediction results. The SHAP results indicated that our models can well reflect the structural features of ALK inhibitors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Molecular Diversity 化学-化学综合

CiteScore

7.30

自引率

7.90%

发文量

219

审稿时长

2.7 months

期刊介绍： Molecular Diversity is a new publication forum for the rapid publication of refereed papers dedicated to describing the development, application and theory of molecular diversity and combinatorial chemistry in basic and applied research and drug discovery. The journal publishes both short and full papers, perspectives, news and reviews dealing with all aspects of the generation of molecular diversity, application of diversity for screening against alternative targets of all types (biological, biophysical, technological), analysis of results obtained and their application in various scientific disciplines/approaches including: combinatorial chemistry and parallel synthesis; small molecule libraries; microwave synthesis; flow synthesis; fluorous synthesis; diversity oriented synthesis (DOS); nanoreactors; click chemistry; multiplex technologies; fragment- and ligand-based design; structure/function/SAR; computational chemistry and molecular design; chemoinformatics; screening techniques and screening interfaces; analytical and purification methods; robotics, automation and miniaturization; targeted libraries; display libraries; peptides and peptoids; proteins; oligonucleotides; carbohydrates; natural diversity; new methods of library formulation and deconvolution; directed evolution, origin of life and recombination; search techniques, landscapes, random chemistry and more;