Validation and Community Sharing of Ocean Spectral Libraries Generated by Machine Learning for Data Independent Acquisition Ocean Metaproteomic Analyses

IF 3.9 4区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

Proteomics Pub Date : 2025-06-11 DOI:10.1002/pmic.13971

Margaret Mars Brisbin, Matthew R. McIlvin, Damien Beau Wilburn, Jaclyn K. Saunders, Natalie R. Cohen, Maya Bhatia, Elizabeth Kujawinski, Brian C. Searle, Mak A. Saito

{"title":"Validation and Community Sharing of Ocean Spectral Libraries Generated by Machine Learning for Data Independent Acquisition Ocean Metaproteomic Analyses","authors":"Margaret Mars Brisbin, Matthew R. McIlvin, Damien Beau Wilburn, Jaclyn K. Saunders, Natalie R. Cohen, Maya Bhatia, Elizabeth Kujawinski, Brian C. Searle, Mak A. Saito","doi":"10.1002/pmic.13971","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Ocean metaproteomics provides valuable insights into the structure and function of marine microbial communities. Yet, ocean samples are challenging due to their extensive biological diversity, which results in a very large number of peptides with a large dynamic range. This study characterized the capabilities of data independent acquisition (DIA) mode for use in ocean metaproteomic samples. Spectral libraries were constructed from discovered peptides and proteins using machine learning (ML) algorithms to remove the incorporation of false positives in the libraries. When compared with 1-dimensional and 2-dimensional data dependent acquisition analyses (DDA), DIA outperformed DDA both with and without gas phase fractionation. We found that larger discovered protein spectral libraries performed better, regardless of the geographic distance between where samples were collected for library generation and where the test samples were collected. Moreover, the spectral library containing all unique proteins present in the Ocean Protein Portal (OPP) outperformed smaller libraries generated from individual sampling campaigns. However, a spectral library constructed from all open reading frames (ORFs) in a metagenome was found to be too large to be workable, resulting in low peptide identifications due to challenges in maintaining a low false discovery rate with such a large database size. Given sufficient sequencing depth and validation studies, spectral libraries generated from previously discovered proteins can serve as a community resource, saving resequencing efforts. The spectral libraries generated in this study are available at the OPP to enable future ocean proteomic studies.</p>\n </div>","PeriodicalId":224,"journal":{"name":"Proteomics","volume":"25 13","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proteomics","FirstCategoryId":"99","ListUrlMain":"https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/pmic.13971","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Ocean metaproteomics provides valuable insights into the structure and function of marine microbial communities. Yet, ocean samples are challenging due to their extensive biological diversity, which results in a very large number of peptides with a large dynamic range. This study characterized the capabilities of data independent acquisition (DIA) mode for use in ocean metaproteomic samples. Spectral libraries were constructed from discovered peptides and proteins using machine learning (ML) algorithms to remove the incorporation of false positives in the libraries. When compared with 1-dimensional and 2-dimensional data dependent acquisition analyses (DDA), DIA outperformed DDA both with and without gas phase fractionation. We found that larger discovered protein spectral libraries performed better, regardless of the geographic distance between where samples were collected for library generation and where the test samples were collected. Moreover, the spectral library containing all unique proteins present in the Ocean Protein Portal (OPP) outperformed smaller libraries generated from individual sampling campaigns. However, a spectral library constructed from all open reading frames (ORFs) in a metagenome was found to be too large to be workable, resulting in low peptide identifications due to challenges in maintaining a low false discovery rate with such a large database size. Given sufficient sequencing depth and validation studies, spectral libraries generated from previously discovered proteins can serve as a community resource, saving resequencing efforts. The spectral libraries generated in this study are available at the OPP to enable future ocean proteomic studies.

Abstract Image

查看原文本刊更多论文

基于机器学习的数据独立获取海洋元蛋白质组学分析海洋光谱库的验证与社区共享。

海洋宏蛋白质组学为海洋微生物群落的结构和功能提供了有价值的见解。然而，海洋样品由于其广泛的生物多样性而具有挑战性，这导致了大量具有大动态范围的肽。本研究描述了数据独立采集（DIA）模式在海洋元蛋白质组学样本中使用的能力。利用机器学习（ML）算法从发现的肽和蛋白质构建光谱文库，以消除文库中假阳性的结合。与一维和二维数据依赖采集分析（DDA）相比，无论是否有气相分馏，DIA都优于DDA。我们发现，较大的已发现的蛋白质光谱文库表现得更好，无论在收集样本以生成文库的地点和收集测试样本的地点之间的地理距离如何。此外，包含海洋蛋白质门户（OPP）中存在的所有独特蛋白质的光谱文库优于由单个采样活动生成的较小文库。然而，从宏基因组中所有开放阅读框（orf）构建的光谱库被发现太大而不可行，由于在如此大的数据库规模下保持低错误发现率的挑战，导致肽鉴定低。如果有足够的测序深度和验证研究，从先前发现的蛋白质中生成的谱库可以作为社区资源，节省重测序工作。本研究生成的光谱库可在OPP中使用，以支持未来的海洋蛋白质组学研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proteomics 生物-生化研究方法

CiteScore

6.30

自引率

5.90%

发文量

193

审稿时长

3 months

期刊介绍： PROTEOMICS is the premier international source for information on all aspects of applications and technologies, including software, in proteomics and other "omics". The journal includes but is not limited to proteomics, genomics, transcriptomics, metabolomics and lipidomics, and systems biology approaches. Papers describing novel applications of proteomics and integration of multi-omics data and approaches are especially welcome.