Liang Xue, Shivani Tiwary, Mykola Bordyuh, Robert Stanton
{"title":"CoSpred: Machine Learning Workflow to Predict Tandem Mass Spectrum in Proteomics.","authors":"Liang Xue, Shivani Tiwary, Mykola Bordyuh, Robert Stanton","doi":"10.1002/pmic.70004","DOIUrl":null,"url":null,"abstract":"<p><p>In mass spectrometry-based proteomics, the use of deep learning algorithms can help improve the identification rates of peptides and proteins through the generation of high-fidelity theoretical spectrum which can be used as the basis of a more complete spectral library than those presently available, especially for unobserved protein/genetic variants. Here we focus on providing an end-to-end user-friendly machine learning workflow, which we call Complete Spectrum Predictor (CoSpred). Using CoSpred users can create their own machine learning compatible training dataset and then train a machine learning model to predict both backbone and non-backbone ions. For the model a transformer encoder architecture is used to predict the complete MS/MS spectrum from a given peptide sequence. In addition to the transformer model provided in the package, the code is built modularly to allow for alternate ML models to be easily \"plugged in,\" allowing for spectrum prediction optimization given different experimental conditions. The CoSpred workflow (preprocessing→training→inference) provides a path for state-of-art ML capabilities to be more accessible to proteomics scientists.</p>","PeriodicalId":224,"journal":{"name":"Proteomics","volume":" ","pages":"e70004"},"PeriodicalIF":3.9000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proteomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/pmic.70004","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
In mass spectrometry-based proteomics, the use of deep learning algorithms can help improve the identification rates of peptides and proteins through the generation of high-fidelity theoretical spectrum which can be used as the basis of a more complete spectral library than those presently available, especially for unobserved protein/genetic variants. Here we focus on providing an end-to-end user-friendly machine learning workflow, which we call Complete Spectrum Predictor (CoSpred). Using CoSpred users can create their own machine learning compatible training dataset and then train a machine learning model to predict both backbone and non-backbone ions. For the model a transformer encoder architecture is used to predict the complete MS/MS spectrum from a given peptide sequence. In addition to the transformer model provided in the package, the code is built modularly to allow for alternate ML models to be easily "plugged in," allowing for spectrum prediction optimization given different experimental conditions. The CoSpred workflow (preprocessing→training→inference) provides a path for state-of-art ML capabilities to be more accessible to proteomics scientists.
期刊介绍:
PROTEOMICS is the premier international source for information on all aspects of applications and technologies, including software, in proteomics and other "omics". The journal includes but is not limited to proteomics, genomics, transcriptomics, metabolomics and lipidomics, and systems biology approaches. Papers describing novel applications of proteomics and integration of multi-omics data and approaches are especially welcome.