Revealing deep proteome diversity with community-scale proteomics big data

Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics Pub Date : 2017-12-07 DOI:10.1145/3156346.3156694

N. Bandeira

{"title":"Revealing deep proteome diversity with community-scale proteomics big data","authors":"N. Bandeira","doi":"10.1145/3156346.3156694","DOIUrl":null,"url":null,"abstract":"Translating the growing volumes of proteomics mass spectrometry data into reusable evidence of the occurrence and provenance of proteomics events requires the development of novel algorithms and community-scale computational workflows. MassIVE (http://massive.ucsd.edu) proposes to address this challenge in three stages. First, systematic annotation of human proteomics big data requires automated reanalysis of all public data using open source workflows with detailed records of search parameters and of individual Peptide Spectrum Matches (PSMs). As such, our large-scale reanalysis of tens of terabytes of human data has now increased the total number of proper public PSMs by over 10-fold to over 320 million PSMs whose coverage includes over 95 Second, proper synthesis of community-scale search results into a reusable knowledge base (KB) requires scalable workflows imposing strict statistical controls. Our MassIVE-KB spectral library has thus properly assembled 2+ million precursors from over 1.5 million peptides covering over 6.2 million amino acids in the human proteome, all of which at least double the numbers covered by the popular NIST spectral libraries. Moreover, MassIVE-KB detects 723 novel proteins (PE 2-5) for a total of 16,852 proteins observed in non-synthetic LCMS runs and 19,610 total proteins when including the recent ProteomeTools data. Third, we show how advanced identification algorithms combine with public data to reveal dozens of unexpected putative modifications supported by multiple highly-correlated spectra. These show that protein regions can be observed in over 100 different variants with various combinations of post-translational modifications and cleavage events, thus suggesting that current coverage of proteome diversity (at 1.3 variants per protein region) is far below what is observable in experimental data.","PeriodicalId":415207,"journal":{"name":"Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3156346.3156694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Translating the growing volumes of proteomics mass spectrometry data into reusable evidence of the occurrence and provenance of proteomics events requires the development of novel algorithms and community-scale computational workflows. MassIVE (http://massive.ucsd.edu) proposes to address this challenge in three stages. First, systematic annotation of human proteomics big data requires automated reanalysis of all public data using open source workflows with detailed records of search parameters and of individual Peptide Spectrum Matches (PSMs). As such, our large-scale reanalysis of tens of terabytes of human data has now increased the total number of proper public PSMs by over 10-fold to over 320 million PSMs whose coverage includes over 95 Second, proper synthesis of community-scale search results into a reusable knowledge base (KB) requires scalable workflows imposing strict statistical controls. Our MassIVE-KB spectral library has thus properly assembled 2+ million precursors from over 1.5 million peptides covering over 6.2 million amino acids in the human proteome, all of which at least double the numbers covered by the popular NIST spectral libraries. Moreover, MassIVE-KB detects 723 novel proteins (PE 2-5) for a total of 16,852 proteins observed in non-synthetic LCMS runs and 19,610 total proteins when including the recent ProteomeTools data. Third, we show how advanced identification algorithms combine with public data to reveal dozens of unexpected putative modifications supported by multiple highly-correlated spectra. These show that protein regions can be observed in over 100 different variants with various combinations of post-translational modifications and cleavage events, thus suggesting that current coverage of proteome diversity (at 1.3 variants per protein region) is far below what is observable in experimental data.

查看原文本刊更多论文

利用群落规模的蛋白质组学大数据揭示深层蛋白质组多样性

将越来越多的蛋白质组学质谱数据转化为蛋白质组学事件发生和来源的可重复使用的证据，需要开发新的算法和社区规模的计算工作流程。MassIVE (http://massive.ucsd.edu)建议分三个阶段应对这一挑战。首先，人类蛋白质组学大数据的系统注释需要使用开源工作流程自动重新分析所有公共数据，并详细记录搜索参数和单个肽谱匹配(psm)。因此，我们对数十tb的人类数据进行了大规模的再分析，现在已将适当的公共psm总数增加了10倍以上，达到3.2亿个psm，其覆盖范围超过95秒，将社区规模的搜索结果适当地合成为可重用的知识库(KB)需要可扩展的工作流程，并实施严格的统计控制。因此，我们的MassIVE-KB光谱库正确地组装了200多万个前体，这些前体来自超过150万个肽，覆盖了人类蛋白质组中超过620万个氨基酸，所有这些都至少是流行的NIST光谱库所涵盖的数量的两倍。此外，MassIVE-KB检测到723种新蛋白(PE 2-5)，在非合成LCMS运行中观察到的蛋白总数为16,852种，当包括最近的ProteomeTools数据时，总蛋白总数为19,610种。第三，我们展示了先进的识别算法如何与公共数据相结合，以揭示由多个高度相关光谱支持的数十个意想不到的假设修改。这些结果表明，通过翻译后修饰和切割事件的各种组合，可以在100多种不同的变体中观察到蛋白质区域，从而表明目前蛋白质组多样性的覆盖范围(每个蛋白质区域1.3个变体)远低于实验数据中观察到的水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics

自引率

0.00%

发文量