以专家为指导的蛋白质语言模型能够准确快速地预测适合度。

Bioinformatics (Oxford, England) Pub Date : 2024-11-22 DOI:10.1093/bioinformatics/btae621

Céline Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine

{"title":"以专家为指导的蛋白质语言模型能够准确快速地预测适合度。","authors":"Céline Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine","doi":"10.1093/bioinformatics/btae621","DOIUrl":null,"url":null,"abstract":"Motivation: Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein Language Model (pLM) embeddings as input to a minimal deep learning model.Results: To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark(217 multiplex assays of variant effect- MAVE- with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48±0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).Availability: VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction.\",\"authors\":\"Céline Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine\",\"doi\":\"10.1093/bioinformatics/btae621\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein Language Model (pLM) embeddings as input to a minimal deep learning model.Results: To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark(217 multiplex assays of variant effect- MAVE- with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48±0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).Availability: VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.Supplementary information: Supplementary data are available at Bioinformatics online.\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-11-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btae621\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae621","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

动机对所有已知蛋白质变体的效应进行详尽的实验注释仍然是一项艰巨而昂贵的工作，这就强调了对可扩展效应预测的需求。我们利用蛋白质语言模型（pLM）嵌入作为最小深度学习模型的输入，推出了快速的错义氨基酸变体效应预测器 VespaG：为了克服实验训练数据稀少的问题，我们从人类蛋白质组中创建了一个包含 3,900 万个单氨基酸变体的数据集，并应用基于多序列比对的效应预测器 GEMME 作为伪真理标准。与基线 pLM 相比，这种设置提高了可解释性，而且很容易用新的或更新的 pLM 进行再训练。根据 ProteinGym 基准（217 项变体效应多重检测--MAVE--250 万个变体）进行评估，VespaG 的平均斯皮尔曼相关性为 0.48±0.02，与在相同数据上评估的顶级方法不相上下。VespaG 的优势在于速度快了几个数量级，在一台消费级笔记本电脑（12 核 CPU、16 GB 内存）上预测智人或黑腹果蝇等蛋白质组中所有蛋白质的所有突变景观只需不到 30 分钟：VespaG 可在 https://github.com/jschlensok/vespag 免费获取。相关的训练数据和预测结果可从 https://doi.org/10.5281/zenodo.11085958.Supplementary 信息中获取：补充数据可在 Bioinformatics online 上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction.

Motivation: Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein Language Model (pLM) embeddings as input to a minimal deep learning model.

Results: To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark(217 multiplex assays of variant effect- MAVE- with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48±0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).

Availability: VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.

Supplementary information: Supplementary data are available at Bioinformatics online.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量