Harendra Guturu, Andrew Nichols, Lee S Cantrell, Seth Just, Janos Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou
{"title":"Cloud-Enabled Scalable Analysis of Large Proteomics Cohorts.","authors":"Harendra Guturu, Andrew Nichols, Lee S Cantrell, Seth Just, Janos Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou","doi":"10.1021/acs.jproteome.4c00771","DOIUrl":null,"url":null,"abstract":"<p><p>Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies enable large-scale cohort proteomic and proteogenomic analyses. As such, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. We additionally show that empirical spectra generated by Scalable MBR better approximates DIA-NN native MBR compared to semiempirical alternatives such as ID-RT-IM MBR, preserving user choice to use empirical libraries in large cohort analysis. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph Analysis Suite.</p>","PeriodicalId":48,"journal":{"name":"Journal of Proteome Research","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Proteome Research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1021/acs.jproteome.4c00771","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies enable large-scale cohort proteomic and proteogenomic analyses. As such, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. We additionally show that empirical spectra generated by Scalable MBR better approximates DIA-NN native MBR compared to semiempirical alternatives such as ID-RT-IM MBR, preserving user choice to use empirical libraries in large cohort analysis. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph Analysis Suite.
期刊介绍:
Journal of Proteome Research publishes content encompassing all aspects of global protein analysis and function, including the dynamic aspects of genomics, spatio-temporal proteomics, metabonomics and metabolomics, clinical and agricultural proteomics, as well as advances in methodology including bioinformatics. The theme and emphasis is on a multidisciplinary approach to the life sciences through the synergy between the different types of "omics".