Revealing deep proteome diversity with community-scale proteomics big data

N. Bandeira
{"title":"Revealing deep proteome diversity with community-scale proteomics big data","authors":"N. Bandeira","doi":"10.1145/3156346.3156694","DOIUrl":null,"url":null,"abstract":"Translating the growing volumes of proteomics mass spectrometry data into reusable evidence of the occurrence and provenance of proteomics events requires the development of novel algorithms and community-scale computational workflows. MassIVE (http://massive.ucsd.edu) proposes to address this challenge in three stages. First, systematic annotation of human proteomics big data requires automated reanalysis of all public data using open source workflows with detailed records of search parameters and of individual Peptide Spectrum Matches (PSMs). As such, our large-scale reanalysis of tens of terabytes of human data has now increased the total number of proper public PSMs by over 10-fold to over 320 million PSMs whose coverage includes over 95 Second, proper synthesis of community-scale search results into a reusable knowledge base (KB) requires scalable workflows imposing strict statistical controls. Our MassIVE-KB spectral library has thus properly assembled 2+ million precursors from over 1.5 million peptides covering over 6.2 million amino acids in the human proteome, all of which at least double the numbers covered by the popular NIST spectral libraries. Moreover, MassIVE-KB detects 723 novel proteins (PE 2-5) for a total of 16,852 proteins observed in non-synthetic LCMS runs and 19,610 total proteins when including the recent ProteomeTools data. Third, we show how advanced identification algorithms combine with public data to reveal dozens of unexpected putative modifications supported by multiple highly-correlated spectra. These show that protein regions can be observed in over 100 different variants with various combinations of post-translational modifications and cleavage events, thus suggesting that current coverage of proteome diversity (at 1.3 variants per protein region) is far below what is observable in experimental data.","PeriodicalId":415207,"journal":{"name":"Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3156346.3156694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Translating the growing volumes of proteomics mass spectrometry data into reusable evidence of the occurrence and provenance of proteomics events requires the development of novel algorithms and community-scale computational workflows. MassIVE (http://massive.ucsd.edu) proposes to address this challenge in three stages. First, systematic annotation of human proteomics big data requires automated reanalysis of all public data using open source workflows with detailed records of search parameters and of individual Peptide Spectrum Matches (PSMs). As such, our large-scale reanalysis of tens of terabytes of human data has now increased the total number of proper public PSMs by over 10-fold to over 320 million PSMs whose coverage includes over 95 Second, proper synthesis of community-scale search results into a reusable knowledge base (KB) requires scalable workflows imposing strict statistical controls. Our MassIVE-KB spectral library has thus properly assembled 2+ million precursors from over 1.5 million peptides covering over 6.2 million amino acids in the human proteome, all of which at least double the numbers covered by the popular NIST spectral libraries. Moreover, MassIVE-KB detects 723 novel proteins (PE 2-5) for a total of 16,852 proteins observed in non-synthetic LCMS runs and 19,610 total proteins when including the recent ProteomeTools data. Third, we show how advanced identification algorithms combine with public data to reveal dozens of unexpected putative modifications supported by multiple highly-correlated spectra. These show that protein regions can be observed in over 100 different variants with various combinations of post-translational modifications and cleavage events, thus suggesting that current coverage of proteome diversity (at 1.3 variants per protein region) is far below what is observable in experimental data.
利用群落规模的蛋白质组学大数据揭示深层蛋白质组多样性
将越来越多的蛋白质组学质谱数据转化为蛋白质组学事件发生和来源的可重复使用的证据,需要开发新的算法和社区规模的计算工作流程。MassIVE (http://massive.ucsd.edu)建议分三个阶段应对这一挑战。首先,人类蛋白质组学大数据的系统注释需要使用开源工作流程自动重新分析所有公共数据,并详细记录搜索参数和单个肽谱匹配(psm)。因此,我们对数十tb的人类数据进行了大规模的再分析,现在已将适当的公共psm总数增加了10倍以上,达到3.2亿个psm,其覆盖范围超过95秒,将社区规模的搜索结果适当地合成为可重用的知识库(KB)需要可扩展的工作流程,并实施严格的统计控制。因此,我们的MassIVE-KB光谱库正确地组装了200多万个前体,这些前体来自超过150万个肽,覆盖了人类蛋白质组中超过620万个氨基酸,所有这些都至少是流行的NIST光谱库所涵盖的数量的两倍。此外,MassIVE-KB检测到723种新蛋白(PE 2-5),在非合成LCMS运行中观察到的蛋白总数为16,852种,当包括最近的ProteomeTools数据时,总蛋白总数为19,610种。第三,我们展示了先进的识别算法如何与公共数据相结合,以揭示由多个高度相关光谱支持的数十个意想不到的假设修改。这些结果表明,通过翻译后修饰和切割事件的各种组合,可以在100多种不同的变体中观察到蛋白质区域,从而表明目前蛋白质组多样性的覆盖范围(每个蛋白质区域1.3个变体)远低于实验数据中观察到的水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信