MultiStageSearch: An Iterative Workflow for Unbiased Taxonomic Analysis of Pathogens Using Proteogenomics.

IF 3.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Journal of Proteome Research Pub Date : 2025-05-18 DOI:10.1021/acs.jproteome.4c00901

Julian Pipart, Tanja Holstein, Lennart Martens, Thilo Muth

{"title":"MultiStageSearch: An Iterative Workflow for Unbiased Taxonomic Analysis of Pathogens Using Proteogenomics.","authors":"Julian Pipart, Tanja Holstein, Lennart Martens, Thilo Muth","doi":"10.1021/acs.jproteome.4c00901","DOIUrl":null,"url":null,"abstract":"<p><p>The global SARS-CoV-2 pandemic emphasized the need for accurate pathogen diagnostics. While genomics is the gold standard, integrating mass spectrometry-based proteomics offers additional benefits. However, current proteomic and genomic reference databases are often biased toward specific taxa, such as pathogenic strains or model organisms, and proteomic databases are less comprehensive. These biases and gaps can lead to inaccurate identifications. To address these issues, we introduce MultiStageSearch, a multistep database search method that combines proteome and genome databases for taxonomic analysis. Initially, a generalist proteome database is used to infer potential species. Then, MultiStageSearch generates a specialized proteogenomic database for precise identification. This database is preprocessed to filter duplicates and cluster identical open reading frames to reduce genomic database biases. The workflow operates independently of strain-level NCBI taxonomy, enabling the identification of strains not represented in existing taxonomies. We benchmarked the workflow on viral and bacterial samples, demonstrating its superior performance in strain-level taxonomic inference compared to existing methods. MultiStageSearch offers a flexible and accurate approach for pathogen research and diagnostics, overcoming incomplete search spaces and biases inherent in reference databases.</p>","PeriodicalId":48,"journal":{"name":"Journal of Proteome Research","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Proteome Research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1021/acs.jproteome.4c00901","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The global SARS-CoV-2 pandemic emphasized the need for accurate pathogen diagnostics. While genomics is the gold standard, integrating mass spectrometry-based proteomics offers additional benefits. However, current proteomic and genomic reference databases are often biased toward specific taxa, such as pathogenic strains or model organisms, and proteomic databases are less comprehensive. These biases and gaps can lead to inaccurate identifications. To address these issues, we introduce MultiStageSearch, a multistep database search method that combines proteome and genome databases for taxonomic analysis. Initially, a generalist proteome database is used to infer potential species. Then, MultiStageSearch generates a specialized proteogenomic database for precise identification. This database is preprocessed to filter duplicates and cluster identical open reading frames to reduce genomic database biases. The workflow operates independently of strain-level NCBI taxonomy, enabling the identification of strains not represented in existing taxonomies. We benchmarked the workflow on viral and bacterial samples, demonstrating its superior performance in strain-level taxonomic inference compared to existing methods. MultiStageSearch offers a flexible and accurate approach for pathogen research and diagnostics, overcoming incomplete search spaces and biases inherent in reference databases.

查看原文本刊更多论文

多阶段研究：利用蛋白质基因组学对病原体进行无偏分类分析的迭代工作流程。

全球SARS-CoV-2大流行强调了准确诊断病原体的必要性。虽然基因组学是金标准，但整合基于质谱的蛋白质组学提供了额外的好处。然而，目前的蛋白质组学和基因组参考数据库往往偏向于特定的分类群，如致病菌株或模式生物，蛋白质组学数据库不太全面。这些偏见和差距可能导致不准确的识别。为了解决这些问题，我们引入了MultiStageSearch，这是一种结合蛋白质组和基因组数据库进行分类分析的多步骤数据库搜索方法。最初，一个通用的蛋白质组数据库被用来推断潜在的物种。然后，MultiStageSearch生成一个专门的蛋白质基因组数据库，用于精确识别。该数据库经过预处理以过滤重复和聚类相同的开放阅读框，以减少基因组数据库的偏差。该工作流独立于菌株级NCBI分类法运行，从而能够识别现有分类法中未表示的菌株。我们对病毒和细菌样本的工作流程进行了基准测试，与现有方法相比，证明了其在菌株水平分类推断方面的优越性能。multistagesch为病原体研究和诊断提供了一种灵活而准确的方法，克服了参考数据库中不完整的搜索空间和固有的偏见。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Proteome Research 生物-生化研究方法

CiteScore

9.00

自引率

4.50%

发文量

251

审稿时长

3 months

期刊介绍： Journal of Proteome Research publishes content encompassing all aspects of global protein analysis and function, including the dynamic aspects of genomics, spatio-temporal proteomics, metabonomics and metabolomics, clinical and agricultural proteomics, as well as advances in methodology including bioinformatics. The theme and emphasis is on a multidisciplinary approach to the life sciences through the synergy between the different types of "omics".