Cloud-Enabled Scalable Analysis of Large Proteomics Cohorts.

IF 3.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Journal of Proteome Research Pub Date : 2025-03-07 Epub Date: 2025-02-13 DOI:10.1021/acs.jproteome.4c00771

Harendra Guturu, Andrew Nichols, Lee S Cantrell, Seth Just, Janos Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou

{"title":"Cloud-Enabled Scalable Analysis of Large Proteomics Cohorts.","authors":"Harendra Guturu, Andrew Nichols, Lee S Cantrell, Seth Just, Janos Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou","doi":"10.1021/acs.jproteome.4c00771","DOIUrl":null,"url":null,"abstract":"<p><p>Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies enable large-scale cohort proteomic and proteogenomic analyses. As such, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. We additionally show that empirical spectra generated by Scalable MBR better approximates DIA-NN native MBR compared to semiempirical alternatives such as ID-RT-IM MBR, preserving user choice to use empirical libraries in large cohort analysis. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph Analysis Suite.</p>","PeriodicalId":48,"journal":{"name":"Journal of Proteome Research","volume":" ","pages":"1462-1469"},"PeriodicalIF":3.8000,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Proteome Research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1021/acs.jproteome.4c00771","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/13 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies enable large-scale cohort proteomic and proteogenomic analyses. As such, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. We additionally show that empirical spectra generated by Scalable MBR better approximates DIA-NN native MBR compared to semiempirical alternatives such as ID-RT-IM MBR, preserving user choice to use empirical libraries in large cohort analysis. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph Analysis Suite.

查看原文本刊更多论文

基于非靶向质谱的蛋白质组学技术在深度和通量方面的快速发展，使大规模队列蛋白质组学和蛋白质基因组学分析成为可能。因此，处理数据所需的数据基础设施和搜索引擎也必须扩展。这种挑战在依靠无库运行间匹配（MBR）搜索的搜索引擎中更为严峻，因为这种搜索能提高每个样本的深度和数据的完整性。然而，迄今为止，还没有一种基于 MBR 的搜索能扩展到处理成千上万或更多个体的队列。在此，我们提出了一种无需修改源代码即可在分布式云环境中部署搜索引擎的策略，从而提高了资源的可扩展性和吞吐量。此外，我们还介绍了一种名为 "可扩展 MBR "的算法，该算法复制了流行的 DIA-NN 软件的 MBR 程序，可扩展至数千个样本。我们证明，与 DIA-NN 原始 MBR 程序所需的数天时间相比，Scalable MBR 可在几小时内搜索数千个 MS 原始文件，并证明其结果与 DIA-NN 原始 MBR 的结果几乎没有区别。我们还证明，与 ID-RT-IM MBR 等半经验替代方法相比，可扩展 MBR 生成的经验光谱更接近 DIA-NN 本机 MBR，从而保留了用户在大型队列分析中使用经验库的选择。经测试，该方法可扩展到 15,000 次以上的注射，并可在 Proteograph 分析套件中使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Proteome Research 生物-生化研究方法

CiteScore

9.00

自引率

4.50%

发文量

251

审稿时长

3 months

期刊介绍： Journal of Proteome Research publishes content encompassing all aspects of global protein analysis and function, including the dynamic aspects of genomics, spatio-temporal proteomics, metabonomics and metabolomics, clinical and agricultural proteomics, as well as advances in methodology including bioinformatics. The theme and emphasis is on a multidisciplinary approach to the life sciences through the synergy between the different types of "omics".