{"title":"Revealing deep proteome diversity with community-scale proteomics big data","authors":"N. Bandeira","doi":"10.1145/3156346.3156694","DOIUrl":null,"url":null,"abstract":"Translating the growing volumes of proteomics mass spectrometry data into reusable evidence of the occurrence and provenance of proteomics events requires the development of novel algorithms and community-scale computational workflows. MassIVE (http://massive.ucsd.edu) proposes to address this challenge in three stages. First, systematic annotation of human proteomics big data requires automated reanalysis of all public data using open source workflows with detailed records of search parameters and of individual Peptide Spectrum Matches (PSMs). As such, our large-scale reanalysis of tens of terabytes of human data has now increased the total number of proper public PSMs by over 10-fold to over 320 million PSMs whose coverage includes over 95 Second, proper synthesis of community-scale search results into a reusable knowledge base (KB) requires scalable workflows imposing strict statistical controls. Our MassIVE-KB spectral library has thus properly assembled 2+ million precursors from over 1.5 million peptides covering over 6.2 million amino acids in the human proteome, all of which at least double the numbers covered by the popular NIST spectral libraries. Moreover, MassIVE-KB detects 723 novel proteins (PE 2-5) for a total of 16,852 proteins observed in non-synthetic LCMS runs and 19,610 total proteins when including the recent ProteomeTools data. Third, we show how advanced identification algorithms combine with public data to reveal dozens of unexpected putative modifications supported by multiple highly-correlated spectra. These show that protein regions can be observed in over 100 different variants with various combinations of post-translational modifications and cleavage events, thus suggesting that current coverage of proteome diversity (at 1.3 variants per protein region) is far below what is observable in experimental data.","PeriodicalId":415207,"journal":{"name":"Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3156346.3156694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Translating the growing volumes of proteomics mass spectrometry data into reusable evidence of the occurrence and provenance of proteomics events requires the development of novel algorithms and community-scale computational workflows. MassIVE (http://massive.ucsd.edu) proposes to address this challenge in three stages. First, systematic annotation of human proteomics big data requires automated reanalysis of all public data using open source workflows with detailed records of search parameters and of individual Peptide Spectrum Matches (PSMs). As such, our large-scale reanalysis of tens of terabytes of human data has now increased the total number of proper public PSMs by over 10-fold to over 320 million PSMs whose coverage includes over 95 Second, proper synthesis of community-scale search results into a reusable knowledge base (KB) requires scalable workflows imposing strict statistical controls. Our MassIVE-KB spectral library has thus properly assembled 2+ million precursors from over 1.5 million peptides covering over 6.2 million amino acids in the human proteome, all of which at least double the numbers covered by the popular NIST spectral libraries. Moreover, MassIVE-KB detects 723 novel proteins (PE 2-5) for a total of 16,852 proteins observed in non-synthetic LCMS runs and 19,610 total proteins when including the recent ProteomeTools data. Third, we show how advanced identification algorithms combine with public data to reveal dozens of unexpected putative modifications supported by multiple highly-correlated spectra. These show that protein regions can be observed in over 100 different variants with various combinations of post-translational modifications and cleavage events, thus suggesting that current coverage of proteome diversity (at 1.3 variants per protein region) is far below what is observable in experimental data.