NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model.

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-08-19 DOI:10.1186/s12859-025-06220-2

Line Sandvad Nielsen, Anders Gorm Pedersen, Ole Winther, Henrik Nielsen

{"title":"NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model.","authors":"Line Sandvad Nielsen, Anders Gorm Pedersen, Ole Winther, Henrik Nielsen","doi":"10.1186/s12859-025-06220-2","DOIUrl":null,"url":null,"abstract":"Background: Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5[Formula: see text] end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model.Results: We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions.Conclusion: By leveraging \"protein-ness\", NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/ .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"216"},"PeriodicalIF":3.3000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12366053/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06220-2","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5[Formula: see text] end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model.

Results: We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions.

Conclusion: By leveraging "protein-ness", NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/ .

Abstract Image

查看原文本刊更多论文

NetStart 2.0：利用蛋白质语言模型预测真核生物翻译起始位点。

背景：准确识别翻译起始位点对于mRNA正确翻译成功能蛋白至关重要。在真核生物中，翻译起始位点的选择受到多种因素的影响，包括其与5[公式：见文本]末端的接近程度和局部起始密码子上下文。翻译起始位点标志着非编码区向编码区过渡。这一事实激发了人们的期望，即上游序列，如果翻译，将组装一个无意义的氨基酸顺序，而下游序列将对应于蛋白质的结构起始。这一区别提示了使用蛋白质语言模型预测翻译起始位点的潜力。结果：我们提出了NetStart 2.0，这是一个基于深度学习的模型，它将ESM-2蛋白语言模型与局部序列上下文集成在一起，用于预测真核生物物种的翻译起始位点。NetStart 2.0作为跨多个物种的单一模型进行训练，尽管在训练数据中表现出广泛的系统发育多样性，但它始终依赖于标志着从非编码区向编码区过渡的特征。结论：通过利用“蛋白性”，NetStart 2.0在预测多种真核生物物种的翻译起始位点方面达到了最先进的性能。这一成功强调了蛋白质语言模型在复杂生物预测任务中架起转录物和肽水平信息桥梁的潜力。NetStart 2.0 web服务器可在：https://services.healthtech.dtu.dk/services/NetStart-2.0/获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.