{"title":"使用两种数据分割方法的机器学习算法对无性淋巴瘤激酶 (ALK) 抑制剂进行分类模型和 SAR 分析。","authors":"Dan Qu, Aixia Yan","doi":"10.1007/s11030-024-10990-x","DOIUrl":null,"url":null,"abstract":"<p><p>Anaplastic lymphoma kinase (ALK) plays a critical role in the development of various cancers. In this study, the dataset of 1810 collected inhibitors were divided into a training set and a test set by the self-organizing map (SOM) and random method, respectively. We developed 32 classification models using Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) to distinguish between highly and weakly active ALK inhibitors, with the inhibitors represented by MACCS and ECFP4 fingerprints. Model 7D which was built by the RF algorithm using training set 1/test set 1 divided by the SOM method, provided the best performance with a prediction accuracy of 90.97% and a Matthews correlation coefficient (MCC) value of 0.79 on the test set. We clustered the 1810 inhibitors into 10 subsets by K-Means algorithm to find out the structural characteristics of highly active ALK inhibitors. The main scaffolds of highly active ALK inhibitors were also analyzed based on ECFP4 fingerprints. It was found that some substructures have a significant effect on high activity, such as 2,4-diarylaminopyrimidine analogues, pyrrolo[2,1-f][1,2,4]triazin, indolo[2,3-b]quinoline-11-one, benzo[d]imidazol and pyrrolo[2,3-b]pyridine. In addition, the subsets were summarized into several clusters, among which four clusters showed a significant relationship with ALK inhibitory activity. Finally, Shapley additive explanations (SHAP) was also used to explain the influence of modeling features on model prediction results. The SHAP results indicated that our models can well reflect the structural features of ALK inhibitors.</p>","PeriodicalId":708,"journal":{"name":"Molecular Diversity","volume":" ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Classification models and SAR analysis of anaplastic lymphoma kinase (ALK) inhibitors using machine learning algorithms with two data division methods.\",\"authors\":\"Dan Qu, Aixia Yan\",\"doi\":\"10.1007/s11030-024-10990-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Anaplastic lymphoma kinase (ALK) plays a critical role in the development of various cancers. In this study, the dataset of 1810 collected inhibitors were divided into a training set and a test set by the self-organizing map (SOM) and random method, respectively. We developed 32 classification models using Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) to distinguish between highly and weakly active ALK inhibitors, with the inhibitors represented by MACCS and ECFP4 fingerprints. Model 7D which was built by the RF algorithm using training set 1/test set 1 divided by the SOM method, provided the best performance with a prediction accuracy of 90.97% and a Matthews correlation coefficient (MCC) value of 0.79 on the test set. We clustered the 1810 inhibitors into 10 subsets by K-Means algorithm to find out the structural characteristics of highly active ALK inhibitors. The main scaffolds of highly active ALK inhibitors were also analyzed based on ECFP4 fingerprints. It was found that some substructures have a significant effect on high activity, such as 2,4-diarylaminopyrimidine analogues, pyrrolo[2,1-f][1,2,4]triazin, indolo[2,3-b]quinoline-11-one, benzo[d]imidazol and pyrrolo[2,3-b]pyridine. In addition, the subsets were summarized into several clusters, among which four clusters showed a significant relationship with ALK inhibitory activity. Finally, Shapley additive explanations (SHAP) was also used to explain the influence of modeling features on model prediction results. The SHAP results indicated that our models can well reflect the structural features of ALK inhibitors.</p>\",\"PeriodicalId\":708,\"journal\":{\"name\":\"Molecular Diversity\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2024-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Molecular Diversity\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1007/s11030-024-10990-x\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Diversity","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1007/s11030-024-10990-x","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}
Classification models and SAR analysis of anaplastic lymphoma kinase (ALK) inhibitors using machine learning algorithms with two data division methods.
Anaplastic lymphoma kinase (ALK) plays a critical role in the development of various cancers. In this study, the dataset of 1810 collected inhibitors were divided into a training set and a test set by the self-organizing map (SOM) and random method, respectively. We developed 32 classification models using Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) to distinguish between highly and weakly active ALK inhibitors, with the inhibitors represented by MACCS and ECFP4 fingerprints. Model 7D which was built by the RF algorithm using training set 1/test set 1 divided by the SOM method, provided the best performance with a prediction accuracy of 90.97% and a Matthews correlation coefficient (MCC) value of 0.79 on the test set. We clustered the 1810 inhibitors into 10 subsets by K-Means algorithm to find out the structural characteristics of highly active ALK inhibitors. The main scaffolds of highly active ALK inhibitors were also analyzed based on ECFP4 fingerprints. It was found that some substructures have a significant effect on high activity, such as 2,4-diarylaminopyrimidine analogues, pyrrolo[2,1-f][1,2,4]triazin, indolo[2,3-b]quinoline-11-one, benzo[d]imidazol and pyrrolo[2,3-b]pyridine. In addition, the subsets were summarized into several clusters, among which four clusters showed a significant relationship with ALK inhibitory activity. Finally, Shapley additive explanations (SHAP) was also used to explain the influence of modeling features on model prediction results. The SHAP results indicated that our models can well reflect the structural features of ALK inhibitors.
期刊介绍:
Molecular Diversity is a new publication forum for the rapid publication of refereed papers dedicated to describing the development, application and theory of molecular diversity and combinatorial chemistry in basic and applied research and drug discovery. The journal publishes both short and full papers, perspectives, news and reviews dealing with all aspects of the generation of molecular diversity, application of diversity for screening against alternative targets of all types (biological, biophysical, technological), analysis of results obtained and their application in various scientific disciplines/approaches including:
combinatorial chemistry and parallel synthesis;
small molecule libraries;
microwave synthesis;
flow synthesis;
fluorous synthesis;
diversity oriented synthesis (DOS);
nanoreactors;
click chemistry;
multiplex technologies;
fragment- and ligand-based design;
structure/function/SAR;
computational chemistry and molecular design;
chemoinformatics;
screening techniques and screening interfaces;
analytical and purification methods;
robotics, automation and miniaturization;
targeted libraries;
display libraries;
peptides and peptoids;
proteins;
oligonucleotides;
carbohydrates;
natural diversity;
new methods of library formulation and deconvolution;
directed evolution, origin of life and recombination;
search techniques, landscapes, random chemistry and more;