DrugPred: An ensemble learning model based on ESM2 for predicting potential druggable proteins

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-03-15 DOI:10.1016/j.future.2025.107801

Hong-Qi Zhang , Shang-Hua Liu , Jun-Wen Yu , Rui Li , Dong-Xin Ye , Yan-Ting Jin , Cheng-Bing Huang , Ke-Jun Deng

{"title":"DrugPred: An ensemble learning model based on ESM2 for predicting potential druggable proteins","authors":"Hong-Qi Zhang , Shang-Hua Liu , Jun-Wen Yu , Rui Li , Dong-Xin Ye , Yan-Ting Jin , Cheng-Bing Huang , Ke-Jun Deng","doi":"10.1016/j.future.2025.107801","DOIUrl":null,"url":null,"abstract":"<div><div>The Human Genome Project has generated abundant data for a long time, but transforming this data into practical and usable drug or drug target products remains challenging. This study proposed an ensemble learning model, called DrugPred, to achieve prediction of drug targets using evolutionary scale modeling (ESM2) and amino acid composition (AAC) as features. ESM2 utilized deep learning technology to study the sequence-structure-function relationship of protein sequences, extracting highly abstract features of proteins. AAC translated protein sequences into amino acid percentages, reflecting the composition of amino acids in proteins. The integration of two features constituted a multidimensional and diverse feature space, enabling the model to perform well in predicting drug targets. We input the fused features into four machine learning algorithms for separate training and generated the prediction probabilities, then input them into a support vector machine for voting decisions. This ensemble learning overcame the bias of a single algorithm model in information learning and improved the stability and accuracy of the model. After comprehensive evaluation, the model achieved an accuracy of 0.9691 with an area under the receiver operating characteristic curve (AUC) value of 0.9868. We also used t-distributed Stochastic Neighbor Embedding (t-SNE) and SHapley Additive exPlanations (SHAP) techniques to explore the interpretability of the DrugPred model. This study provided a fresh perspective and method for identifying drug targets, offering robust support for future drug development. We have developed and made publicly accessible a web server based on the DrugPred model. The web server is accessible at <span><span>http://drugpred.lin-group.cn/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"170 ","pages":"Article 107801"},"PeriodicalIF":6.2000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25000962","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The Human Genome Project has generated abundant data for a long time, but transforming this data into practical and usable drug or drug target products remains challenging. This study proposed an ensemble learning model, called DrugPred, to achieve prediction of drug targets using evolutionary scale modeling (ESM2) and amino acid composition (AAC) as features. ESM2 utilized deep learning technology to study the sequence-structure-function relationship of protein sequences, extracting highly abstract features of proteins. AAC translated protein sequences into amino acid percentages, reflecting the composition of amino acids in proteins. The integration of two features constituted a multidimensional and diverse feature space, enabling the model to perform well in predicting drug targets. We input the fused features into four machine learning algorithms for separate training and generated the prediction probabilities, then input them into a support vector machine for voting decisions. This ensemble learning overcame the bias of a single algorithm model in information learning and improved the stability and accuracy of the model. After comprehensive evaluation, the model achieved an accuracy of 0.9691 with an area under the receiver operating characteristic curve (AUC) value of 0.9868. We also used t-distributed Stochastic Neighbor Embedding (t-SNE) and SHapley Additive exPlanations (SHAP) techniques to explore the interpretability of the DrugPred model. This study provided a fresh perspective and method for identifying drug targets, offering robust support for future drug development. We have developed and made publicly accessible a web server based on the DrugPred model. The web server is accessible at http://drugpred.lin-group.cn/.

Abstract Image

查看原文本刊更多论文

DrugPred：一个基于ESM2的集成学习模型，用于预测潜在的可药物蛋白

长期以来，人类基因组计划已经产生了丰富的数据，但将这些数据转化为实际可用的药物或药物靶标产品仍然具有挑战性。本研究提出了一种名为DrugPred的集成学习模型，以进化尺度模型（ESM2）和氨基酸组成（AAC）为特征来实现药物靶点的预测。ESM2利用深度学习技术研究蛋白质序列的序列-结构-功能关系，提取蛋白质高度抽象的特征。AAC将蛋白质序列翻译成氨基酸百分比，反映蛋白质中氨基酸的组成。两种特征的融合构成了一个多维的、多样化的特征空间，使得该模型能够很好地预测药物靶点。我们将融合后的特征输入到四种机器学习算法中分别进行训练，生成预测概率，然后输入到支持向量机中进行投票决策。这种集成学习克服了单一算法模型在信息学习中的偏差，提高了模型的稳定性和准确性。综合评价，该模型准确率为0.9691，受试者工作特征曲线下面积（AUC）值为0.9868。我们还使用t分布随机邻居嵌入（t-SNE）和SHapley加性解释（SHAP）技术来探索DrugPred模型的可解释性。该研究为药物靶点的识别提供了新的视角和方法，为未来的药物开发提供了有力的支持。我们开发了一个基于DrugPred模型的web服务器，并使其可公开访问。web服务器可通过http://drugpred.lin-group.cn/访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.