Yehor Horokhovskyi, Hanna P Roetschke, John A Cormican, Martin Pašen, Sina Garazhian, Michele Mishto, Juliane Liepe
{"title":"An automated workflow to address proteome complexity and the large search space problem in proteomics and HLA-I immunopeptidomics.","authors":"Yehor Horokhovskyi, Hanna P Roetschke, John A Cormican, Martin Pašen, Sina Garazhian, Michele Mishto, Juliane Liepe","doi":"10.1016/j.mcpro.2025.101039","DOIUrl":null,"url":null,"abstract":"<p><p>Antigenic noncanonical epitope and novel protein discovery are research areas with therapeutical applications, predominantly done via mass spectrometry. The latter should rely on a well-characterized proteogenomic search space. Its size is barely known for antigenic noncanonical peptides and novel proteins, and this could impact on their identification. To address these issues, we here develop an automated workflow comprised of Sequoia for the creation of RNA sequencing informed and exhaustive sequence search spaces for various noncanonical peptide origins, and SPIsnake for pre-filtering and exploration of sequence search space prior to mass spectrometry searches. We apply our workflow to characterize the exact sizes of tryptic and nonspecific peptide sequence search spaces in a variety of definitions, their reduction when using RNA expression, their inflation by post-translational modifications, and the frequency of peptide sequence multimapping to different noncanonical origins. Furthermore, we explore the application of Sequoia and SPIsnake on HLA-I immunopeptidomes, thereby rescuing sensitivity in peptide identification when confronted with inflated search spaces. Taken together, Sequoia and SPIsnake pave the way for an educated development of methods addressing large-scale exhaustive proteogenomic discovery by exposing the consequences of database size inflation and ambiguity of peptide and protein sequence identification.</p>","PeriodicalId":18712,"journal":{"name":"Molecular & Cellular Proteomics","volume":" ","pages":"101039"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular & Cellular Proteomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.mcpro.2025.101039","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Antigenic noncanonical epitope and novel protein discovery are research areas with therapeutical applications, predominantly done via mass spectrometry. The latter should rely on a well-characterized proteogenomic search space. Its size is barely known for antigenic noncanonical peptides and novel proteins, and this could impact on their identification. To address these issues, we here develop an automated workflow comprised of Sequoia for the creation of RNA sequencing informed and exhaustive sequence search spaces for various noncanonical peptide origins, and SPIsnake for pre-filtering and exploration of sequence search space prior to mass spectrometry searches. We apply our workflow to characterize the exact sizes of tryptic and nonspecific peptide sequence search spaces in a variety of definitions, their reduction when using RNA expression, their inflation by post-translational modifications, and the frequency of peptide sequence multimapping to different noncanonical origins. Furthermore, we explore the application of Sequoia and SPIsnake on HLA-I immunopeptidomes, thereby rescuing sensitivity in peptide identification when confronted with inflated search spaces. Taken together, Sequoia and SPIsnake pave the way for an educated development of methods addressing large-scale exhaustive proteogenomic discovery by exposing the consequences of database size inflation and ambiguity of peptide and protein sequence identification.
期刊介绍:
The mission of MCP is to foster the development and applications of proteomics in both basic and translational research. MCP will publish manuscripts that report significant new biological or clinical discoveries underpinned by proteomic observations across all kingdoms of life. Manuscripts must define the biological roles played by the proteins investigated or their mechanisms of action.
The journal also emphasizes articles that describe innovative new computational methods and technological advancements that will enable future discoveries. Manuscripts describing such approaches do not have to include a solution to a biological problem, but must demonstrate that the technology works as described, is reproducible and is appropriate to uncover yet unknown protein/proteome function or properties using relevant model systems or publicly available data.
Scope:
-Fundamental studies in biology, including integrative "omics" studies, that provide mechanistic insights
-Novel experimental and computational technologies
-Proteogenomic data integration and analysis that enable greater understanding of physiology and disease processes
-Pathway and network analyses of signaling that focus on the roles of post-translational modifications
-Studies of proteome dynamics and quality controls, and their roles in disease
-Studies of evolutionary processes effecting proteome dynamics, quality and regulation
-Chemical proteomics, including mechanisms of drug action
-Proteomics of the immune system and antigen presentation/recognition
-Microbiome proteomics, host-microbe and host-pathogen interactions, and their roles in health and disease
-Clinical and translational studies of human diseases
-Metabolomics to understand functional connections between genes, proteins and phenotypes