Hong-Qi Zhang , Shang-Hua Liu , Jun-Wen Yu , Rui Li , Dong-Xin Ye , Yan-Ting Jin , Cheng-Bing Huang , Ke-Jun Deng
{"title":"DrugPred: An ensemble learning model based on ESM2 for predicting potential druggable proteins","authors":"Hong-Qi Zhang , Shang-Hua Liu , Jun-Wen Yu , Rui Li , Dong-Xin Ye , Yan-Ting Jin , Cheng-Bing Huang , Ke-Jun Deng","doi":"10.1016/j.future.2025.107801","DOIUrl":null,"url":null,"abstract":"<div><div>The Human Genome Project has generated abundant data for a long time, but transforming this data into practical and usable drug or drug target products remains challenging. This study proposed an ensemble learning model, called DrugPred, to achieve prediction of drug targets using evolutionary scale modeling (ESM2) and amino acid composition (AAC) as features. ESM2 utilized deep learning technology to study the sequence-structure-function relationship of protein sequences, extracting highly abstract features of proteins. AAC translated protein sequences into amino acid percentages, reflecting the composition of amino acids in proteins. The integration of two features constituted a multidimensional and diverse feature space, enabling the model to perform well in predicting drug targets. We input the fused features into four machine learning algorithms for separate training and generated the prediction probabilities, then input them into a support vector machine for voting decisions. This ensemble learning overcame the bias of a single algorithm model in information learning and improved the stability and accuracy of the model. After comprehensive evaluation, the model achieved an accuracy of 0.9691 with an area under the receiver operating characteristic curve (AUC) value of 0.9868. We also used t-distributed Stochastic Neighbor Embedding (t-SNE) and SHapley Additive exPlanations (SHAP) techniques to explore the interpretability of the DrugPred model. This study provided a fresh perspective and method for identifying drug targets, offering robust support for future drug development. We have developed and made publicly accessible a web server based on the DrugPred model. The web server is accessible at <span><span>http://drugpred.lin-group.cn/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"170 ","pages":"Article 107801"},"PeriodicalIF":6.2000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25000962","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
The Human Genome Project has generated abundant data for a long time, but transforming this data into practical and usable drug or drug target products remains challenging. This study proposed an ensemble learning model, called DrugPred, to achieve prediction of drug targets using evolutionary scale modeling (ESM2) and amino acid composition (AAC) as features. ESM2 utilized deep learning technology to study the sequence-structure-function relationship of protein sequences, extracting highly abstract features of proteins. AAC translated protein sequences into amino acid percentages, reflecting the composition of amino acids in proteins. The integration of two features constituted a multidimensional and diverse feature space, enabling the model to perform well in predicting drug targets. We input the fused features into four machine learning algorithms for separate training and generated the prediction probabilities, then input them into a support vector machine for voting decisions. This ensemble learning overcame the bias of a single algorithm model in information learning and improved the stability and accuracy of the model. After comprehensive evaluation, the model achieved an accuracy of 0.9691 with an area under the receiver operating characteristic curve (AUC) value of 0.9868. We also used t-distributed Stochastic Neighbor Embedding (t-SNE) and SHapley Additive exPlanations (SHAP) techniques to explore the interpretability of the DrugPred model. This study provided a fresh perspective and method for identifying drug targets, offering robust support for future drug development. We have developed and made publicly accessible a web server based on the DrugPred model. The web server is accessible at http://drugpred.lin-group.cn/.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.