通过蛋白质语言模型进行肽测序

arXiv - QuanBio - Biomolecules Pub Date : 2024-08-01 DOI:arxiv-2408.00892

Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J. Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo, Jacob M. Luber

{"title":"通过蛋白质语言模型进行肽测序","authors":"Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J. Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo, Jacob M. Luber","doi":"arxiv-2408.00892","DOIUrl":null,"url":null,"abstract":"We introduce a protein language model for determining the complete sequence\nof a peptide based on measurement of a limited set of amino acids. To date,\nprotein sequencing relies on mass spectrometry, with some novel edman\ndegregation based platforms able to sequence non-native peptides. Current\nprotein sequencing techniques face limitations in accurately identifying all\namino acids, hindering comprehensive proteome analysis. Our method simulates\npartial sequencing data by selectively masking amino acids that are\nexperimentally difficult to identify in protein sequences from the UniRef\ndatabase. This targeted masking mimics real-world sequencing limitations. We\nthen modify and finetune a ProtBert derived transformer-based model, for a new\ndownstream task predicting these masked residues, providing an approximation of\nthe complete sequence. Evaluating on three bacterial Escherichia species, we\nachieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM])\nare known. Structural assessment using AlphaFold and TM-score validates the\nbiological relevance of our predictions. The model also demonstrates potential\nfor evolutionary analysis through cross-species performance. This integration\nof simulated experimental constraints with computational predictions offers a\npromising avenue for enhancing protein sequence analysis, potentially\naccelerating advancements in proteomics and structural biology by providing a\nprobabilistic reconstruction of the complete protein sequence from limited\nexperimental data.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Peptide Sequencing Via Protein Language Models\",\"authors\":\"Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J. Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo, Jacob M. Luber\",\"doi\":\"arxiv-2408.00892\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce a protein language model for determining the complete sequence\\nof a peptide based on measurement of a limited set of amino acids. To date,\\nprotein sequencing relies on mass spectrometry, with some novel edman\\ndegregation based platforms able to sequence non-native peptides. Current\\nprotein sequencing techniques face limitations in accurately identifying all\\namino acids, hindering comprehensive proteome analysis. Our method simulates\\npartial sequencing data by selectively masking amino acids that are\\nexperimentally difficult to identify in protein sequences from the UniRef\\ndatabase. This targeted masking mimics real-world sequencing limitations. We\\nthen modify and finetune a ProtBert derived transformer-based model, for a new\\ndownstream task predicting these masked residues, providing an approximation of\\nthe complete sequence. Evaluating on three bacterial Escherichia species, we\\nachieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM])\\nare known. Structural assessment using AlphaFold and TM-score validates the\\nbiological relevance of our predictions. The model also demonstrates potential\\nfor evolutionary analysis through cross-species performance. This integration\\nof simulated experimental constraints with computational predictions offers a\\npromising avenue for enhancing protein sequence analysis, potentially\\naccelerating advancements in proteomics and structural biology by providing a\\nprobabilistic reconstruction of the complete protein sequence from limited\\nexperimental data.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"30 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00892\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00892","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们介绍了一种蛋白质语言模型，可根据对有限氨基酸组的测量来确定肽的完整序列。迄今为止，蛋白质测序主要依赖于质谱分析法，而一些基于氨基酸分解的新型平台能够对非原生肽进行测序。目前的蛋白质测序技术在准确鉴定所有氨基酸方面存在局限性，阻碍了蛋白质组的全面分析。我们的方法模拟了部分测序数据，选择性地屏蔽了在 UniRef 数据库的蛋白质序列中难以识别的氨基酸。这种有针对性的屏蔽模拟了真实世界测序的局限性。我们对基于 ProtBert 衍生的转换器模型进行了修改和微调，以完成预测这些屏蔽残基的新下游任务，从而提供完整序列的近似值。通过对三种细菌埃希氏菌进行评估，当只知道四个氨基酸（[KCYM]）时，我们实现了高达 90.5% 的每个氨基酸准确率。使用 AlphaFold 和 TM-score 进行的结构评估验证了我们预测的生物学相关性。该模型还通过跨物种表现展示了进化分析的潜力。这种将模拟实验约束与计算预测相结合的方法为加强蛋白质序列分析提供了一条新途径，通过从有限的实验数据中提供完整蛋白质序列的概率重建，有可能加速蛋白质组学和结构生物学的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Peptide Sequencing Via Protein Language Models

We introduce a protein language model for determining the complete sequence of a peptide based on measurement of a limited set of amino acids. To date, protein sequencing relies on mass spectrometry, with some novel edman degregation based platforms able to sequence non-native peptides. Current protein sequencing techniques face limitations in accurately identifying all amino acids, hindering comprehensive proteome analysis. Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify in protein sequences from the UniRef database. This targeted masking mimics real-world sequencing limitations. We then modify and finetune a ProtBert derived transformer-based model, for a new downstream task predicting these masked residues, providing an approximation of the complete sequence. Evaluating on three bacterial Escherichia species, we achieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM]) are known. Structural assessment using AlphaFold and TM-score validates the biological relevance of our predictions. The model also demonstrates potential for evolutionary analysis through cross-species performance. This integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis, potentially accelerating advancements in proteomics and structural biology by providing a probabilistic reconstruction of the complete protein sequence from limited experimental data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量