Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction.

Céline Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine
{"title":"Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction.","authors":"Céline Marquet, Julius Schlensok, Marina Abakarova, Burkhard Rost, Elodie Laine","doi":"10.1093/bioinformatics/btae621","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein Language Model (pLM) embeddings as input to a minimal deep learning model.</p><p><strong>Results: </strong>To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark(217 multiplex assays of variant effect- MAVE- with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48±0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).</p><p><strong>Availability: </strong>VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae621","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Exhaustive experimental annotation of the effect of all known protein variants remains daunting and expensive, stressing the need for scalable effect predictions. We introduce VespaG, a blazingly fast missense amino acid variant effect predictor, leveraging protein Language Model (pLM) embeddings as input to a minimal deep learning model.

Results: To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from the human proteome applying the multiple sequence alignment-based effect predictor GEMME as a pseudo standard-of-truth. This setup increases interpretability compared to the baseline pLM and is easily retrainable with novel or updated pLMs. Assessed against the ProteinGym benchmark(217 multiplex assays of variant effect- MAVE- with 2.5 million variants), VespaG achieved a mean Spearman correlation of 0.48±0.02, matching top-performing methods evaluated on the same data. VespaG has the advantage of being orders of magnitude faster, predicting all mutational landscapes of all proteins in proteomes such as Homo sapiens or Drosophila melanogaster in under 30 minutes on a consumer laptop (12-core CPU, 16 GB RAM).

Availability: VespaG is available freely at https://github.com/jschlensok/vespag. The associated training data and predictions are available at https://doi.org/10.5281/zenodo.11085958.

Supplementary information: Supplementary data are available at Bioinformatics online.

以专家为指导的蛋白质语言模型能够准确快速地预测适合度。
动机对所有已知蛋白质变体的效应进行详尽的实验注释仍然是一项艰巨而昂贵的工作,这就强调了对可扩展效应预测的需求。我们利用蛋白质语言模型(pLM)嵌入作为最小深度学习模型的输入,推出了快速的错义氨基酸变体效应预测器 VespaG:为了克服实验训练数据稀少的问题,我们从人类蛋白质组中创建了一个包含 3,900 万个单氨基酸变体的数据集,并应用基于多序列比对的效应预测器 GEMME 作为伪真理标准。与基线 pLM 相比,这种设置提高了可解释性,而且很容易用新的或更新的 pLM 进行再训练。根据 ProteinGym 基准(217 项变体效应多重检测--MAVE--250 万个变体)进行评估,VespaG 的平均斯皮尔曼相关性为 0.48±0.02,与在相同数据上评估的顶级方法不相上下。VespaG 的优势在于速度快了几个数量级,在一台消费级笔记本电脑(12 核 CPU、16 GB 内存)上预测智人或黑腹果蝇等蛋白质组中所有蛋白质的所有突变景观只需不到 30 分钟:VespaG 可在 https://github.com/jschlensok/vespag 免费获取。相关的训练数据和预测结果可从 https://doi.org/10.5281/zenodo.11085958.Supplementary 信息中获取:补充数据可在 Bioinformatics online 上获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信