Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J. Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo, Jacob M. Luber
{"title":"通过蛋白质语言模型进行肽测序","authors":"Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J. Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo, Jacob M. Luber","doi":"arxiv-2408.00892","DOIUrl":null,"url":null,"abstract":"We introduce a protein language model for determining the complete sequence\nof a peptide based on measurement of a limited set of amino acids. To date,\nprotein sequencing relies on mass spectrometry, with some novel edman\ndegregation based platforms able to sequence non-native peptides. Current\nprotein sequencing techniques face limitations in accurately identifying all\namino acids, hindering comprehensive proteome analysis. Our method simulates\npartial sequencing data by selectively masking amino acids that are\nexperimentally difficult to identify in protein sequences from the UniRef\ndatabase. This targeted masking mimics real-world sequencing limitations. We\nthen modify and finetune a ProtBert derived transformer-based model, for a new\ndownstream task predicting these masked residues, providing an approximation of\nthe complete sequence. Evaluating on three bacterial Escherichia species, we\nachieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM])\nare known. Structural assessment using AlphaFold and TM-score validates the\nbiological relevance of our predictions. The model also demonstrates potential\nfor evolutionary analysis through cross-species performance. This integration\nof simulated experimental constraints with computational predictions offers a\npromising avenue for enhancing protein sequence analysis, potentially\naccelerating advancements in proteomics and structural biology by providing a\nprobabilistic reconstruction of the complete protein sequence from limited\nexperimental data.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Peptide Sequencing Via Protein Language Models\",\"authors\":\"Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J. Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo, Jacob M. Luber\",\"doi\":\"arxiv-2408.00892\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce a protein language model for determining the complete sequence\\nof a peptide based on measurement of a limited set of amino acids. To date,\\nprotein sequencing relies on mass spectrometry, with some novel edman\\ndegregation based platforms able to sequence non-native peptides. Current\\nprotein sequencing techniques face limitations in accurately identifying all\\namino acids, hindering comprehensive proteome analysis. Our method simulates\\npartial sequencing data by selectively masking amino acids that are\\nexperimentally difficult to identify in protein sequences from the UniRef\\ndatabase. This targeted masking mimics real-world sequencing limitations. We\\nthen modify and finetune a ProtBert derived transformer-based model, for a new\\ndownstream task predicting these masked residues, providing an approximation of\\nthe complete sequence. Evaluating on three bacterial Escherichia species, we\\nachieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM])\\nare known. Structural assessment using AlphaFold and TM-score validates the\\nbiological relevance of our predictions. The model also demonstrates potential\\nfor evolutionary analysis through cross-species performance. This integration\\nof simulated experimental constraints with computational predictions offers a\\npromising avenue for enhancing protein sequence analysis, potentially\\naccelerating advancements in proteomics and structural biology by providing a\\nprobabilistic reconstruction of the complete protein sequence from limited\\nexperimental data.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"30 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00892\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00892","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
We introduce a protein language model for determining the complete sequence
of a peptide based on measurement of a limited set of amino acids. To date,
protein sequencing relies on mass spectrometry, with some novel edman
degregation based platforms able to sequence non-native peptides. Current
protein sequencing techniques face limitations in accurately identifying all
amino acids, hindering comprehensive proteome analysis. Our method simulates
partial sequencing data by selectively masking amino acids that are
experimentally difficult to identify in protein sequences from the UniRef
database. This targeted masking mimics real-world sequencing limitations. We
then modify and finetune a ProtBert derived transformer-based model, for a new
downstream task predicting these masked residues, providing an approximation of
the complete sequence. Evaluating on three bacterial Escherichia species, we
achieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM])
are known. Structural assessment using AlphaFold and TM-score validates the
biological relevance of our predictions. The model also demonstrates potential
for evolutionary analysis through cross-species performance. This integration
of simulated experimental constraints with computational predictions offers a
promising avenue for enhancing protein sequence analysis, potentially
accelerating advancements in proteomics and structural biology by providing a
probabilistic reconstruction of the complete protein sequence from limited
experimental data.