Validation and Community Sharing of Ocean Spectral Libraries Generated by Machine Learning for Data Independent Acquisition Ocean Metaproteomic Analyses.
Margaret Mars Brisbin, Matthew R McIlvin, Damien Beau Wilburn, Jaclyn K Saunders, Natalie R Cohen, Maya Bhatia, Elizabeth Kujawinski, Brian C Searle, Mak A Saito
{"title":"Validation and Community Sharing of Ocean Spectral Libraries Generated by Machine Learning for Data Independent Acquisition Ocean Metaproteomic Analyses.","authors":"Margaret Mars Brisbin, Matthew R McIlvin, Damien Beau Wilburn, Jaclyn K Saunders, Natalie R Cohen, Maya Bhatia, Elizabeth Kujawinski, Brian C Searle, Mak A Saito","doi":"10.1002/pmic.13971","DOIUrl":null,"url":null,"abstract":"<p><p>Ocean metaproteomics provides valuable insights into the structure and function of marine microbial communities. Yet, ocean samples are challenging due to their extensive biological diversity, which results in a very large number of peptides with a large dynamic range. This study characterized the capabilities of data independent acquisition (DIA) mode for use in ocean metaproteomic samples. Spectral libraries were constructed from discovered peptides and proteins using machine learning (ML) algorithms to remove the incorporation of false positives in the libraries. When compared with 1-dimensional and 2-dimensional data dependent acquisition analyses (DDA), DIA outperformed DDA both with and without gas phase fractionation. We found that larger discovered protein spectral libraries performed better, regardless of the geographic distance between where samples were collected for library generation and where the test samples were collected. Moreover, the spectral library containing all unique proteins present in the Ocean Protein Portal (OPP) outperformed smaller libraries generated from individual sampling campaigns. However, a spectral library constructed from all open reading frames (ORFs) in a metagenome was found to be too large to be workable, resulting in low peptide identifications due to challenges in maintaining a low false discovery rate with such a large database size. Given sufficient sequencing depth and validation studies, spectral libraries generated from previously discovered proteins can serve as a community resource, saving resequencing efforts. The spectral libraries generated in this study are available at the OPP to enable future ocean proteomic studies.</p>","PeriodicalId":224,"journal":{"name":"Proteomics","volume":" ","pages":"e13971"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proteomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/pmic.13971","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Ocean metaproteomics provides valuable insights into the structure and function of marine microbial communities. Yet, ocean samples are challenging due to their extensive biological diversity, which results in a very large number of peptides with a large dynamic range. This study characterized the capabilities of data independent acquisition (DIA) mode for use in ocean metaproteomic samples. Spectral libraries were constructed from discovered peptides and proteins using machine learning (ML) algorithms to remove the incorporation of false positives in the libraries. When compared with 1-dimensional and 2-dimensional data dependent acquisition analyses (DDA), DIA outperformed DDA both with and without gas phase fractionation. We found that larger discovered protein spectral libraries performed better, regardless of the geographic distance between where samples were collected for library generation and where the test samples were collected. Moreover, the spectral library containing all unique proteins present in the Ocean Protein Portal (OPP) outperformed smaller libraries generated from individual sampling campaigns. However, a spectral library constructed from all open reading frames (ORFs) in a metagenome was found to be too large to be workable, resulting in low peptide identifications due to challenges in maintaining a low false discovery rate with such a large database size. Given sufficient sequencing depth and validation studies, spectral libraries generated from previously discovered proteins can serve as a community resource, saving resequencing efforts. The spectral libraries generated in this study are available at the OPP to enable future ocean proteomic studies.
期刊介绍:
PROTEOMICS is the premier international source for information on all aspects of applications and technologies, including software, in proteomics and other "omics". The journal includes but is not limited to proteomics, genomics, transcriptomics, metabolomics and lipidomics, and systems biology approaches. Papers describing novel applications of proteomics and integration of multi-omics data and approaches are especially welcome.