靶向神经变性：使用PubChem和scikit-learn发现G9a抑制剂的三种机器学习方法。

IF 3.1 3区生物学 Q3 BIOCHEMISTRY & MOLECULAR BIOLOGY

Journal of Computer-Aided Molecular Design Pub Date : 2025-08-06 DOI:10.1007/s10822-025-00642-z

Mariya L. Ivanova, Nicola Russo, Konstantin Nikolic

{"title":"靶向神经变性：使用PubChem和scikit-learn发现G9a抑制剂的三种机器学习方法。","authors":"Mariya L. Ivanova, Nicola Russo, Konstantin Nikolic","doi":"10.1007/s10822-025-00642-z","DOIUrl":null,"url":null,"abstract":"<div><p>In light of the increasing interest in G9a’s role in neuroscience, three machine learning (ML) models, that are time efficient and cost effective, were developed to support researchers in this area. The models are based on data provided by PubChem and performed by algorithms interpreted by the scikit-learn Python-based ML library. The first ML model aimed to predict the efficacy magnitude of active G9a inhibitors. The ML models were trained with 3112 and tested with 778 samples. The Gradient Boosting Regressor perform the best, achieving 17.81% means relative error, 21.48% mean absolute error, 27.39% root mean squared error and 0.02 coefficient of determination (R<sup>2</sup>) error. The goal of the second ML model, called a CID_SID ML model, utilised PubChem identifiers to predict the G9a inhibition of a small biomolecule that has been primarily designed for different purposes. The ML models were trained with 58,552 samples and tested with 14,000. The most suitable classifier for this case study was the Extreme Gradient Boosting Classifier, which obtained 79.7% accuracy, 83.2% precision,67.7% recall, 74.7% F1-score and 78.4% ROC. Up to date, this methodology has been used in seven studies, achieving a mean accuracy of 82.75%, precision of 90.71%, Recall of 73.01%, F1-score of 80.79% and ROC of 80.63% across all case studies. The third ML model utilised IUPAC names. It was based on the Random Forest Classifier algorithm, trained with 19,455 samples and tested with 14,100. The probability of this prediction was 68.2% accuracy. Its feature importance list was reordered by the relative proportion of active cases in which they participate. Thus, “iodide” was identified as the one with the highest relative proportion of the active cases to all cases where this fragment participated. In addition, ‘iodo’ was identified as the most desirable fragment, and “phenylcarbamate” as the least desirable based on their participation only in active or inactive cases, respectively. The computational approach has been initially developed and demonstrated using a case study on Tyrosyl-DNA phosphodiesterase 1(TDP 1) inhibition.</p></div>","PeriodicalId":621,"journal":{"name":"Journal of Computer-Aided Molecular Design","volume":"39 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Targeting neurodegeneration: three machine learning methods for G9a inhibitors discovery using PubChem and scikit-learn\",\"authors\":\"Mariya L. Ivanova, Nicola Russo, Konstantin Nikolic\",\"doi\":\"10.1007/s10822-025-00642-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In light of the increasing interest in G9a’s role in neuroscience, three machine learning (ML) models, that are time efficient and cost effective, were developed to support researchers in this area. The models are based on data provided by PubChem and performed by algorithms interpreted by the scikit-learn Python-based ML library. The first ML model aimed to predict the efficacy magnitude of active G9a inhibitors. The ML models were trained with 3112 and tested with 778 samples. The Gradient Boosting Regressor perform the best, achieving 17.81% means relative error, 21.48% mean absolute error, 27.39% root mean squared error and 0.02 coefficient of determination (R<sup>2</sup>) error. The goal of the second ML model, called a CID_SID ML model, utilised PubChem identifiers to predict the G9a inhibition of a small biomolecule that has been primarily designed for different purposes. The ML models were trained with 58,552 samples and tested with 14,000. The most suitable classifier for this case study was the Extreme Gradient Boosting Classifier, which obtained 79.7% accuracy, 83.2% precision,67.7% recall, 74.7% F1-score and 78.4% ROC. Up to date, this methodology has been used in seven studies, achieving a mean accuracy of 82.75%, precision of 90.71%, Recall of 73.01%, F1-score of 80.79% and ROC of 80.63% across all case studies. The third ML model utilised IUPAC names. It was based on the Random Forest Classifier algorithm, trained with 19,455 samples and tested with 14,100. The probability of this prediction was 68.2% accuracy. Its feature importance list was reordered by the relative proportion of active cases in which they participate. Thus, “iodide” was identified as the one with the highest relative proportion of the active cases to all cases where this fragment participated. In addition, ‘iodo’ was identified as the most desirable fragment, and “phenylcarbamate” as the least desirable based on their participation only in active or inactive cases, respectively. The computational approach has been initially developed and demonstrated using a case study on Tyrosyl-DNA phosphodiesterase 1(TDP 1) inhibition.</p></div>\",\"PeriodicalId\":621,\"journal\":{\"name\":\"Journal of Computer-Aided Molecular Design\",\"volume\":\"39 1\",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer-Aided Molecular Design\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10822-025-00642-z\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer-Aided Molecular Design","FirstCategoryId":"99","ListUrlMain":"https://link.springer.com/article/10.1007/s10822-025-00642-z","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

鉴于对G9a在神经科学中的作用越来越感兴趣，开发了三种时间效率和成本效益高的机器学习（ML）模型来支持该领域的研究人员。这些模型基于PubChem提供的数据，并由scikit-learn基于python的ML库解释的算法执行。第一个ML模型旨在预测活性G9a抑制剂的疗效大小。ML模型训练了3112个样本，测试了778个样本。梯度增强回归器表现最好，平均相对误差17.81%，平均绝对误差21.48%，均方根误差27.39%，决定系数（R2）误差0.02。第二个ML模型的目标，称为CID_SID ML模型，利用PubChem标识符来预测主要用于不同目的的小生物分子的G9a抑制。ML模型训练了58,552个样本，测试了14,000个样本。最适合本案例研究的分类器是极端梯度增强分类器，其准确率为79.7%，精密度为83.2%，召回率为67.7%，f1评分为74.7%，ROC为78.4%。迄今为止，该方法已在7项研究中使用，所有案例研究的平均准确率为82.75%，精密度为90.71%，召回率为73.01%，f1评分为80.79%，ROC为80.63%。第三个ML模型使用IUPAC名称。它基于随机森林分类器算法，训练了19,455个样本，测试了14,100个样本。该预测的准确率为68.2%。其特征重要性列表按照他们参与的活跃案例的相对比例重新排序。因此，“碘化物”被确定为该片段参与的所有病例中活跃病例的相对比例最高的一种。此外，“碘”被确定为最理想的片段，“苯基氨基甲酸酯”被确定为最不理想的片段，分别基于它们仅在活性或非活性情况下的参与。计算方法已初步开发，并通过酪氨酸- dna磷酸二酯酶1（TDP 1）抑制的案例研究进行了演示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Targeting neurodegeneration: three machine learning methods for G9a inhibitors discovery using PubChem and scikit-learn

In light of the increasing interest in G9a’s role in neuroscience, three machine learning (ML) models, that are time efficient and cost effective, were developed to support researchers in this area. The models are based on data provided by PubChem and performed by algorithms interpreted by the scikit-learn Python-based ML library. The first ML model aimed to predict the efficacy magnitude of active G9a inhibitors. The ML models were trained with 3112 and tested with 778 samples. The Gradient Boosting Regressor perform the best, achieving 17.81% means relative error, 21.48% mean absolute error, 27.39% root mean squared error and 0.02 coefficient of determination (R²) error. The goal of the second ML model, called a CID_SID ML model, utilised PubChem identifiers to predict the G9a inhibition of a small biomolecule that has been primarily designed for different purposes. The ML models were trained with 58,552 samples and tested with 14,000. The most suitable classifier for this case study was the Extreme Gradient Boosting Classifier, which obtained 79.7% accuracy, 83.2% precision,67.7% recall, 74.7% F1-score and 78.4% ROC. Up to date, this methodology has been used in seven studies, achieving a mean accuracy of 82.75%, precision of 90.71%, Recall of 73.01%, F1-score of 80.79% and ROC of 80.63% across all case studies. The third ML model utilised IUPAC names. It was based on the Random Forest Classifier algorithm, trained with 19,455 samples and tested with 14,100. The probability of this prediction was 68.2% accuracy. Its feature importance list was reordered by the relative proportion of active cases in which they participate. Thus, “iodide” was identified as the one with the highest relative proportion of the active cases to all cases where this fragment participated. In addition, ‘iodo’ was identified as the most desirable fragment, and “phenylcarbamate” as the least desirable based on their participation only in active or inactive cases, respectively. The computational approach has been initially developed and demonstrated using a case study on Tyrosyl-DNA phosphodiesterase 1(TDP 1) inhibition.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Computer-Aided Molecular Design 生物-计算机：跨学科应用

CiteScore

8.00

自引率

8.60%

发文量

审稿时长

3 months

期刊介绍： The Journal of Computer-Aided Molecular Design provides a form for disseminating information on both the theory and the application of computer-based methods in the analysis and design of molecules. The scope of the journal encompasses papers which report new and original research and applications in the following areas: - theoretical chemistry; - computational chemistry; - computer and molecular graphics; - molecular modeling; - protein engineering; - drug design; - expert systems; - general structure-property relationships; - molecular dynamics; - chemical database development and usage.