Julian Pipart, Tanja Holstein, Lennart Martens, Thilo Muth
{"title":"MultiStageSearch: An Iterative Workflow for Unbiased Taxonomic Analysis of Pathogens Using Proteogenomics.","authors":"Julian Pipart, Tanja Holstein, Lennart Martens, Thilo Muth","doi":"10.1021/acs.jproteome.4c00901","DOIUrl":null,"url":null,"abstract":"<p><p>The global SARS-CoV-2 pandemic emphasized the need for accurate pathogen diagnostics. While genomics is the gold standard, integrating mass spectrometry-based proteomics offers additional benefits. However, current proteomic and genomic reference databases are often biased toward specific taxa, such as pathogenic strains or model organisms, and proteomic databases are less comprehensive. These biases and gaps can lead to inaccurate identifications. To address these issues, we introduce MultiStageSearch, a multistep database search method that combines proteome and genome databases for taxonomic analysis. Initially, a generalist proteome database is used to infer potential species. Then, MultiStageSearch generates a specialized proteogenomic database for precise identification. This database is preprocessed to filter duplicates and cluster identical open reading frames to reduce genomic database biases. The workflow operates independently of strain-level NCBI taxonomy, enabling the identification of strains not represented in existing taxonomies. We benchmarked the workflow on viral and bacterial samples, demonstrating its superior performance in strain-level taxonomic inference compared to existing methods. MultiStageSearch offers a flexible and accurate approach for pathogen research and diagnostics, overcoming incomplete search spaces and biases inherent in reference databases.</p>","PeriodicalId":48,"journal":{"name":"Journal of Proteome Research","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Proteome Research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1021/acs.jproteome.4c00901","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
The global SARS-CoV-2 pandemic emphasized the need for accurate pathogen diagnostics. While genomics is the gold standard, integrating mass spectrometry-based proteomics offers additional benefits. However, current proteomic and genomic reference databases are often biased toward specific taxa, such as pathogenic strains or model organisms, and proteomic databases are less comprehensive. These biases and gaps can lead to inaccurate identifications. To address these issues, we introduce MultiStageSearch, a multistep database search method that combines proteome and genome databases for taxonomic analysis. Initially, a generalist proteome database is used to infer potential species. Then, MultiStageSearch generates a specialized proteogenomic database for precise identification. This database is preprocessed to filter duplicates and cluster identical open reading frames to reduce genomic database biases. The workflow operates independently of strain-level NCBI taxonomy, enabling the identification of strains not represented in existing taxonomies. We benchmarked the workflow on viral and bacterial samples, demonstrating its superior performance in strain-level taxonomic inference compared to existing methods. MultiStageSearch offers a flexible and accurate approach for pathogen research and diagnostics, overcoming incomplete search spaces and biases inherent in reference databases.
期刊介绍:
Journal of Proteome Research publishes content encompassing all aspects of global protein analysis and function, including the dynamic aspects of genomics, spatio-temporal proteomics, metabonomics and metabolomics, clinical and agricultural proteomics, as well as advances in methodology including bioinformatics. The theme and emphasis is on a multidisciplinary approach to the life sciences through the synergy between the different types of "omics".