一个用于微蛋白发现的机器学习框架。

BMC methods Pub Date : 2025-01-01 Epub Date: 2025-08-01 DOI:10.1186/s44330-025-00037-4

Brendan Miller, Eduardo Vieira de Souza, Victor J Pai, Hosung Kim, Joan M Vaughan, Calvin J Lau, Jolene K Diedrich, Alan Saghatelian

{"title":"一个用于微蛋白发现的机器学习框架。","authors":"Brendan Miller, Eduardo Vieira de Souza, Victor J Pai, Hosung Kim, Joan M Vaughan, Calvin J Lau, Jolene K Diedrich, Alan Saghatelian","doi":"10.1186/s44330-025-00037-4","DOIUrl":null,"url":null,"abstract":"Background: The human genome contains over 3 million small open reading frames (smORFs, ≤ 150 codons). Ribosome profiling and proteogenomics transformed our understanding of these sequences by showing that thousands are actively translated, and hundreds produce detectable peptides by mass spectrometry. However, the random arrangement of codons across the 3-gigabase human genome naturally generates smORFs by chance, suggesting many may represent translational noise or regulatory elements rather than functional proteins. This is supported by the fact that most translating smORFs occur in upstream open reading frames (uORFs), which typically regulate translation of canonical coding sequences rather than encode bioactive microproteins. As interest grows in uncovering biologically meaningful microproteins, a key challenge remains: distinguishing functional smORFs from non-functional or regulatory translation products. Although empirical methods such as individual microprotein studies or large-scale screens can help, these approaches are time-consuming, expensive, and come with technical limitations. New complementary strategies are needed.Methods: To address this challenge, we developed ShortStop, a computational framework based on the idea that not all translating smORFs produce functional proteins, but the ones that do may resemble experimentally characterized microproteins. ShortStop classifies smORFs into two reference groups: Swiss-Prot Analog Microproteins (SAMs), which resemble known microproteins, and PRISMs (Physicochemically Resembling In Silico Microproteins), which are synthetic sequences designed to match the composition of translating smORFs but lacking sequence order or evolutionary selection, and therefore serving as a proxy for non-functional peptides. This two-class system enables machine learning to help prioritize smORFs for downstream study.Results: ShortStop achieved high precision (90-94%), recall (87-96%), and F1 scores (90-93%) across all classes. When applied to a published dataset of translating smORFs, ShortStop classified about 8% as candidates with biochemical properties resembling Swiss-Prot microproteins (i.e., called SAMs). The remaining 92% resembled in silico generated sequences (i.e., called PRISMs), representing noncanonical proteins, non-functional peptides, or regulatory translation events. SAMs showed lower C-terminal hydrophobicity-linked to reduced proteasomal degradation-and greater N-terminal hydrophilicity at neutral pH, suggesting improved solubility and intracellular stability. ShortStop also identified microproteins overlooked by other methods, including one encoded by an upstream overlapping smORF in the StAR gene, which was detectable in human cells and steroid-producing tissues. In a clinical lung cancer dataset, ShortStop uncovered differentially expressed microprotein candidates, several of which were validated by mass spectrometry.Discussion: ShortStop addresses a key gap in microprotein research-the lack of scalable tools to characterize microproteins and standardized negative training data to train machine learning models for microproteins. By providing a classification framework rooted in biochemical features, ShortStop offers a practical solution for targeting smORFs in functional studies, benchmarking new discovery tools, and advancing microprotein research.Supplementary information: The online version contains supplementary material available at 10.1186/s44330-025-00037-4.","PeriodicalId":519945,"journal":{"name":"BMC methods","volume":"2 1","pages":"16"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313729/pdf/","citationCount":"0","resultStr":"{\"title\":\"ShortStop: a machine learning framework for microprotein discovery.\",\"authors\":\"Brendan Miller, Eduardo Vieira de Souza, Victor J Pai, Hosung Kim, Joan M Vaughan, Calvin J Lau, Jolene K Diedrich, Alan Saghatelian\",\"doi\":\"10.1186/s44330-025-00037-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The human genome contains over 3 million small open reading frames (smORFs, ≤ 150 codons). Ribosome profiling and proteogenomics transformed our understanding of these sequences by showing that thousands are actively translated, and hundreds produce detectable peptides by mass spectrometry. However, the random arrangement of codons across the 3-gigabase human genome naturally generates smORFs by chance, suggesting many may represent translational noise or regulatory elements rather than functional proteins. This is supported by the fact that most translating smORFs occur in upstream open reading frames (uORFs), which typically regulate translation of canonical coding sequences rather than encode bioactive microproteins. As interest grows in uncovering biologically meaningful microproteins, a key challenge remains: distinguishing functional smORFs from non-functional or regulatory translation products. Although empirical methods such as individual microprotein studies or large-scale screens can help, these approaches are time-consuming, expensive, and come with technical limitations. New complementary strategies are needed.Methods: To address this challenge, we developed ShortStop, a computational framework based on the idea that not all translating smORFs produce functional proteins, but the ones that do may resemble experimentally characterized microproteins. ShortStop classifies smORFs into two reference groups: Swiss-Prot Analog Microproteins (SAMs), which resemble known microproteins, and PRISMs (Physicochemically Resembling In Silico Microproteins), which are synthetic sequences designed to match the composition of translating smORFs but lacking sequence order or evolutionary selection, and therefore serving as a proxy for non-functional peptides. This two-class system enables machine learning to help prioritize smORFs for downstream study.Results: ShortStop achieved high precision (90-94%), recall (87-96%), and F1 scores (90-93%) across all classes. When applied to a published dataset of translating smORFs, ShortStop classified about 8% as candidates with biochemical properties resembling Swiss-Prot microproteins (i.e., called SAMs). The remaining 92% resembled in silico generated sequences (i.e., called PRISMs), representing noncanonical proteins, non-functional peptides, or regulatory translation events. SAMs showed lower C-terminal hydrophobicity-linked to reduced proteasomal degradation-and greater N-terminal hydrophilicity at neutral pH, suggesting improved solubility and intracellular stability. ShortStop also identified microproteins overlooked by other methods, including one encoded by an upstream overlapping smORF in the StAR gene, which was detectable in human cells and steroid-producing tissues. In a clinical lung cancer dataset, ShortStop uncovered differentially expressed microprotein candidates, several of which were validated by mass spectrometry.Discussion: ShortStop addresses a key gap in microprotein research-the lack of scalable tools to characterize microproteins and standardized negative training data to train machine learning models for microproteins. By providing a classification framework rooted in biochemical features, ShortStop offers a practical solution for targeting smORFs in functional studies, benchmarking new discovery tools, and advancing microprotein research.Supplementary information: The online version contains supplementary material available at 10.1186/s44330-025-00037-4.\",\"PeriodicalId\":519945,\"journal\":{\"name\":\"BMC methods\",\"volume\":\"2 1\",\"pages\":\"16\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313729/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s44330-025-00037-4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/8/1 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s44330-025-00037-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/1 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景：人类基因组包含超过300万个小开放阅读框（smorf，≤150个密码子）。核糖体分析和蛋白质基因组学改变了我们对这些序列的理解，通过质谱分析显示，数千个被积极翻译，数百个产生可检测的肽。然而，在人类基因组中，密码子的随机排列自然会偶然产生smorf，这表明许多smorf可能代表翻译噪声或调节元件，而不是功能蛋白。大多数翻译smorf发生在上游开放阅读框（uorf）中，这一事实支持了这一观点，uorf通常调节规范编码序列的翻译，而不是编码生物活性微蛋白。随着人们对发现具有生物学意义的微蛋白的兴趣日益增长，一个关键的挑战仍然存在：区分功能性smorf与非功能性或调节性翻译产物。尽管个体微蛋白研究或大规模筛选等经验方法可以提供帮助，但这些方法耗时、昂贵，并且存在技术限制。需要新的补充战略。方法：为了解决这一挑战，我们开发了ShortStop，这是一个基于并非所有翻译smorf都产生功能蛋白的想法的计算框架，但那些具有功能的smorf可能类似于实验表征的微蛋白。ShortStop将smorf分为两个参考组：类似于已知微蛋白的Swiss-Prot Analog Microproteins （SAMs）和类似于硅微蛋白的PRISMs（物理化学上类似于硅微蛋白），这是一种合成序列，旨在匹配翻译smorf的组成，但缺乏序列顺序或进化选择，因此作为非功能肽的替代品。这种两级系统使机器学习能够帮助优先考虑下游研究的smorf。结果：游击手在所有类别中具有较高的准确率（90-94%）、召回率（87-96%）和F1分数（90-93%）。当应用于已发表的翻译smorf数据集时，ShortStop将大约8%的候选基因分类为具有类似Swiss-Prot微蛋白（即SAMs）的生化特性。剩下的92%类似于计算机生成的序列（即称为PRISMs），代表非规范蛋白、非功能肽或调节翻译事件。在中性pH下，SAMs表现出较低的c端疏水性（与降低蛋白酶体降解有关）和较强的n端亲水性，表明其溶解度和细胞内稳定性得到改善。ShortStop还发现了被其他方法忽略的微蛋白，包括由StAR基因上游重叠的smORF编码的一种蛋白，这种蛋白在人类细胞和产生类固醇的组织中可以检测到。在临床肺癌数据集中，ShortStop发现了差异表达的候选微蛋白，其中一些通过质谱验证。讨论：ShortStop解决了微蛋白研究中的一个关键空白-缺乏可扩展的工具来表征微蛋白和标准化的负训练数据来训练微蛋白的机器学习模型。通过提供基于生化特征的分类框架，ShortStop为功能研究中的smorf靶向提供了实用的解决方案，为新发现工具的基准测试和推进微蛋白研究提供了实用的解决方案。补充资料：在线版本包含补充资料，下载地址：10.1186/s44330-025-00037-4。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ShortStop: a machine learning framework for microprotein discovery.

Background: The human genome contains over 3 million small open reading frames (smORFs, ≤ 150 codons). Ribosome profiling and proteogenomics transformed our understanding of these sequences by showing that thousands are actively translated, and hundreds produce detectable peptides by mass spectrometry. However, the random arrangement of codons across the 3-gigabase human genome naturally generates smORFs by chance, suggesting many may represent translational noise or regulatory elements rather than functional proteins. This is supported by the fact that most translating smORFs occur in upstream open reading frames (uORFs), which typically regulate translation of canonical coding sequences rather than encode bioactive microproteins. As interest grows in uncovering biologically meaningful microproteins, a key challenge remains: distinguishing functional smORFs from non-functional or regulatory translation products. Although empirical methods such as individual microprotein studies or large-scale screens can help, these approaches are time-consuming, expensive, and come with technical limitations. New complementary strategies are needed.

Methods: To address this challenge, we developed ShortStop, a computational framework based on the idea that not all translating smORFs produce functional proteins, but the ones that do may resemble experimentally characterized microproteins. ShortStop classifies smORFs into two reference groups: Swiss-Prot Analog Microproteins (SAMs), which resemble known microproteins, and PRISMs (Physicochemically Resembling In Silico Microproteins), which are synthetic sequences designed to match the composition of translating smORFs but lacking sequence order or evolutionary selection, and therefore serving as a proxy for non-functional peptides. This two-class system enables machine learning to help prioritize smORFs for downstream study.

Results: ShortStop achieved high precision (90-94%), recall (87-96%), and F1 scores (90-93%) across all classes. When applied to a published dataset of translating smORFs, ShortStop classified about 8% as candidates with biochemical properties resembling Swiss-Prot microproteins (i.e., called SAMs). The remaining 92% resembled in silico generated sequences (i.e., called PRISMs), representing noncanonical proteins, non-functional peptides, or regulatory translation events. SAMs showed lower C-terminal hydrophobicity-linked to reduced proteasomal degradation-and greater N-terminal hydrophilicity at neutral pH, suggesting improved solubility and intracellular stability. ShortStop also identified microproteins overlooked by other methods, including one encoded by an upstream overlapping smORF in the StAR gene, which was detectable in human cells and steroid-producing tissues. In a clinical lung cancer dataset, ShortStop uncovered differentially expressed microprotein candidates, several of which were validated by mass spectrometry.

Discussion: ShortStop addresses a key gap in microprotein research-the lack of scalable tools to characterize microproteins and standardized negative training data to train machine learning models for microproteins. By providing a classification framework rooted in biochemical features, ShortStop offers a practical solution for targeting smORFs in functional studies, benchmarking new discovery tools, and advancing microprotein research.

Supplementary information: The online version contains supplementary material available at 10.1186/s44330-025-00037-4.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC methods

自引率

0.00%

发文量