Prediction of Protein Half-lives from Amino Acid Sequences by Protein Language Models

bioRxiv - Bioinformatics Pub Date : 2024-09-14 DOI:10.1101/2024.09.10.612367

Tatsuya Sagawa, Eisuke Kanao, Kosuke Ogata, Koshi Imami, Yasushi Ishihama

引用次数: 0

Abstract

We developed a protein half-life prediction model, PLTNUM, based on a protein language model using an extensive dataset of protein sequences and protein half-lives from the NIH3T3 mouse embryo fibroblast cell line as a training set. PLTNUM achieved an accuracy of 71% on validation data and showed robust performance with an ROC of 0.73 when applied to a human cell line dataset. By incorporating Shapley Additive Explanations (SHAP) into PLTNUM, we identified key factors contributing to shorter protein half-lives, such as cysteine-containing domains and intrinsically disordered regions. Using SHAP values, PLTNUM can also predict potential degron sequences that shorten protein half-lives. This model provides a platform for elucidating the sequence dependency of protein half-lives, while the uncertainty in predictions underscores the importance of biological context in influencing protein half-lives.

查看原文本刊更多论文

通过蛋白质语言模型从氨基酸序列预测蛋白质半衰期

我们开发了一种蛋白质半衰期预测模型 PLTNUM，该模型基于蛋白质语言模型，使用来自 NIH3T3 小鼠胚胎成纤维细胞系的大量蛋白质序列和蛋白质半衰期数据集作为训练集。PLTNUM 在验证数据上的准确率达到了 71%，在应用于人类细胞系数据集时，其 ROC 为 0.73，表现出强劲的性能。通过在 PLTNUM 中加入 Shapley Additive Explanations (SHAP)，我们确定了导致蛋白质半衰期缩短的关键因素，如含半胱氨酸结构域和内在无序区。利用 SHAP 值，PLTNUM 还能预测缩短蛋白质半衰期的潜在降解子序列。该模型为阐明蛋白质半衰期的序列依赖性提供了一个平台，而预测结果的不确定性则强调了生物背景对影响蛋白质半衰期的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

bioRxiv - Bioinformatics

自引率

0.00%

发文量