DrugProtAI：一种机器学习驱动的方法，通过特征工程和基于鲁棒分割的集成方法来预测蛋白质的可药物性。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2025-07-02 DOI:10.1093/bib/bbaf330

Ankit Halder, Sabyasachi Samantaray, Sahil Barbade, Aditya Gupta, Sanjeeva Srivastava

{"title":"DrugProtAI：一种机器学习驱动的方法，通过特征工程和基于鲁棒分割的集成方法来预测蛋白质的可药物性。","authors":"Ankit Halder, Sabyasachi Samantaray, Sahil Barbade, Aditya Gupta, Sanjeeva Srivastava","doi":"10.1093/bib/bbaf330","DOIUrl":null,"url":null,"abstract":"Drug design and development are central to clinical research, yet 90% of drugs fail to reach the clinic, often due to inappropriate selection of drug targets. Conventional methods for target identification lack precision and sensitivity. While various computational tools have been developed to predict the druggability of proteins, they often focus on limited subsets of the human proteome or rely solely on amino acid properties. Our study presents DrugProtAI, a tool developed by implementing a partitioning-based method and trained on the entire human protein set using both sequence- and non-sequence-derived properties. The partitioned method was evaluated using popular machine learning algorithms, of which Random Forest and XGBoost performed the best. A comprehensive analysis of 183 features, encompassing biophysical, sequence-, and non-sequence-derived properties, achieved a median Area Under Precision-Recall Curve (AUC) of 0.87 in target prediction. The model was further tested on a blinded validation set comprising recently approved drug targets. The key predictors were also identified, which we believe will help users in selecting appropriate drug targets. We believe that these insights are poised to significantly advance drug development. This version of the tool provides the probability of druggability for human proteins. The tool is freely accessible at https://drugprotai.pythonanywhere.com/.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 4","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12236430/pdf/","citationCount":"0","resultStr":"{\"title\":\"DrugProtAI: A machine learning-driven approach for predicting protein druggability through feature engineering and robust partition-based ensemble methods.\",\"authors\":\"Ankit Halder, Sabyasachi Samantaray, Sahil Barbade, Aditya Gupta, Sanjeeva Srivastava\",\"doi\":\"10.1093/bib/bbaf330\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Drug design and development are central to clinical research, yet 90% of drugs fail to reach the clinic, often due to inappropriate selection of drug targets. Conventional methods for target identification lack precision and sensitivity. While various computational tools have been developed to predict the druggability of proteins, they often focus on limited subsets of the human proteome or rely solely on amino acid properties. Our study presents DrugProtAI, a tool developed by implementing a partitioning-based method and trained on the entire human protein set using both sequence- and non-sequence-derived properties. The partitioned method was evaluated using popular machine learning algorithms, of which Random Forest and XGBoost performed the best. A comprehensive analysis of 183 features, encompassing biophysical, sequence-, and non-sequence-derived properties, achieved a median Area Under Precision-Recall Curve (AUC) of 0.87 in target prediction. The model was further tested on a blinded validation set comprising recently approved drug targets. The key predictors were also identified, which we believe will help users in selecting appropriate drug targets. We believe that these insights are poised to significantly advance drug development. This version of the tool provides the probability of druggability for human proteins. The tool is freely accessible at https://drugprotai.pythonanywhere.com/.\",\"PeriodicalId\":9209,\"journal\":{\"name\":\"Briefings in bioinformatics\",\"volume\":\"26 4\",\"pages\":\"\"},\"PeriodicalIF\":7.7000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12236430/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Briefings in bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bib/bbaf330\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf330","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

药物设计和开发是临床研究的核心，但90%的药物无法进入临床，往往是由于药物靶点选择不当。传统的目标识别方法缺乏精度和灵敏度。虽然已经开发了各种计算工具来预测蛋白质的可药物性，但它们通常只关注人类蛋白质组的有限子集或仅依赖于氨基酸特性。我们的研究提出了DrugProtAI，这是一个通过实现基于分割的方法开发的工具，并使用序列和非序列衍生的特性对整个人类蛋白质集进行了训练。使用流行的机器学习算法对分割方法进行评估，其中Random Forest和XGBoost表现最好。综合分析183个特征，包括生物物理、序列和非序列衍生的特性，在目标预测中获得了精确召回曲线下的中位数面积（AUC）为0.87。该模型在包含最近批准的药物靶点的盲法验证集上进一步测试。我们还确定了关键的预测因子，我们相信这将有助于使用者选择合适的药物靶点。我们相信，这些见解将显著推动药物开发。这个版本的工具为人类蛋白质提供了可药物化的可能性。该工具可以在https://drugprotai.pythonanywhere.com/上免费访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DrugProtAI: A machine learning-driven approach for predicting protein druggability through feature engineering and robust partition-based ensemble methods.

Drug design and development are central to clinical research, yet 90% of drugs fail to reach the clinic, often due to inappropriate selection of drug targets. Conventional methods for target identification lack precision and sensitivity. While various computational tools have been developed to predict the druggability of proteins, they often focus on limited subsets of the human proteome or rely solely on amino acid properties. Our study presents DrugProtAI, a tool developed by implementing a partitioning-based method and trained on the entire human protein set using both sequence- and non-sequence-derived properties. The partitioned method was evaluated using popular machine learning algorithms, of which Random Forest and XGBoost performed the best. A comprehensive analysis of 183 features, encompassing biophysical, sequence-, and non-sequence-derived properties, achieved a median Area Under Precision-Recall Curve (AUC) of 0.87 in target prediction. The model was further tested on a blinded validation set comprising recently approved drug targets. The key predictors were also identified, which we believe will help users in selecting appropriate drug targets. We believe that these insights are poised to significantly advance drug development. This version of the tool provides the probability of druggability for human proteins. The tool is freely accessible at https://drugprotai.pythonanywhere.com/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.