Elucidation of genome-wide understudied proteins targeted by PROTAC-induced degradation using interpretable machine learning.

IF 3.6 2区生物学

PLoS Computational Biology Pub Date : 2023-08-17 eCollection Date: 2023-08-01 DOI:10.1371/journal.pcbi.1010974

Li Xie, Lei Xie

{"title":"Elucidation of genome-wide understudied proteins targeted by PROTAC-induced degradation using interpretable machine learning.","authors":"Li Xie, Lei Xie","doi":"10.1371/journal.pcbi.1010974","DOIUrl":null,"url":null,"abstract":"<p><p>Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules that induce the degradation of target proteins by recruiting an E3 ligase. PROTACs have the potential to inactivate disease-related genes that are considered undruggable by small molecules, making them a promising therapy for the treatment of incurable diseases. However, only a few hundred proteins have been experimentally tested for their amenability to PROTACs, and it remains unclear which other proteins in the entire human genome can be targeted by PROTACs. In this study, we have developed PrePROTAC, an interpretable machine learning model based on a transformer-based protein sequence descriptor and random forest classification. PrePROTAC predicts genome-wide targets that can be degraded by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved a ROC-AUC of 0.81, an average precision of 0.84, and over 40% sensitivity at a false positive rate of 0.05. When evaluated by an external test set which comprised proteins from different structural folds than those in the training set, the performance of PrePROTAC did not drop significantly, indicating its generalizability. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method, which extends conventional SHAP analysis for original features to an embedding space through in silico mutagenesis. This method allowed us to identify key residues in the protein structure that play critical roles in PROTAC activity. The identified key residues were consistent with existing knowledge. Using PrePROTAC, we identified over 600 novel understudied proteins that are potentially degradable by CRBN and proposed PROTAC compounds for three novel drug targets associated with Alzheimer's disease.</p>","PeriodicalId":49688,"journal":{"name":"PLoS Computational Biology","volume":"19 8","pages":"e1010974"},"PeriodicalIF":3.6000,"publicationDate":"2023-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10464998/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pcbi.1010974","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/8/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules that induce the degradation of target proteins by recruiting an E3 ligase. PROTACs have the potential to inactivate disease-related genes that are considered undruggable by small molecules, making them a promising therapy for the treatment of incurable diseases. However, only a few hundred proteins have been experimentally tested for their amenability to PROTACs, and it remains unclear which other proteins in the entire human genome can be targeted by PROTACs. In this study, we have developed PrePROTAC, an interpretable machine learning model based on a transformer-based protein sequence descriptor and random forest classification. PrePROTAC predicts genome-wide targets that can be degraded by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved a ROC-AUC of 0.81, an average precision of 0.84, and over 40% sensitivity at a false positive rate of 0.05. When evaluated by an external test set which comprised proteins from different structural folds than those in the training set, the performance of PrePROTAC did not drop significantly, indicating its generalizability. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method, which extends conventional SHAP analysis for original features to an embedding space through in silico mutagenesis. This method allowed us to identify key residues in the protein structure that play critical roles in PROTAC activity. The identified key residues were consistent with existing knowledge. Using PrePROTAC, we identified over 600 novel understudied proteins that are potentially degradable by CRBN and proposed PROTAC compounds for three novel drug targets associated with Alzheimer's disease.

Abstract Image

查看原文本刊更多论文

使用可解释的机器学习阐明PROTAC诱导降解靶向的全基因组研究不足的蛋白质。

蛋白质水解靶向嵌合体（PROTACs）是一种异质双功能分子，通过募集E3连接酶来诱导靶蛋白的降解。PROTAC有可能使被认为是小分子不可治疗的疾病相关基因失活，使其成为治疗不治之症的一种有前景的疗法。然而，只有几百种蛋白质对PROTAC的适应性进行了实验测试，目前尚不清楚整个人类基因组中的哪些其他蛋白质可以被PROTAC靶向。在这项研究中，我们开发了PrePROTAC，这是一个可解释的机器学习模型，基于基于转换器的蛋白质序列描述符和随机森林分类。PrePROTAC预测可以被E3连接酶之一的CRBN降解的全基因组靶标。在基准研究中，PrePROTAC的ROC-AUC为0.81，平均精密度为0.84，在假阳性率为0.05的情况下，灵敏度超过40%。当通过包含与训练集中不同结构折叠的蛋白质的外部测试集进行评估时，PrePROTAC的性能没有显著下降，表明其可推广性。此外，我们开发了一种嵌入SHapley Additive exPlanations（eSHAP）方法，该方法通过计算机诱变将原始特征的传统SHAP分析扩展到嵌入空间。这种方法使我们能够识别蛋白质结构中在PROTAC活性中起关键作用的关键残基。已鉴定的关键残留物与现有知识一致。使用PrePROTAC，我们鉴定了600多种新的研究不足的蛋白质，这些蛋白质可能被CRBN降解，并提出了用于与阿尔茨海默病相关的三个新药物靶点的PROTAC化合物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS Computational Biology 生物-生化研究方法

CiteScore

7.10

自引率

4.70%

发文量

820

期刊介绍： PLOS Computational Biology features works of exceptional significance that further our understanding of living systems at all scales—from molecules and cells, to patient populations and ecosystems—through the application of computational methods. Readers include life and computational scientists, who can take the important findings presented here to the next level of discovery. Research articles must be declared as belonging to a relevant section. More information about the sections can be found in the submission guidelines. Research articles should model aspects of biological systems, demonstrate both methodological and scientific novelty, and provide profound new biological insights. Generally, reliability and significance of biological discovery through computation should be validated and enriched by experimental studies. Inclusion of experimental validation is not required for publication, but should be referenced where possible. Inclusion of experimental validation of a modest biological discovery through computation does not render a manuscript suitable for PLOS Computational Biology. Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities.