D-sORF: Accurate Ab Initio Classification of Experimentally Detected Small Open Reading Frames (sORFs) Associated with Translational Machinery

Biology Pub Date : 2024-07-26 DOI:10.3390/biology13080563
Nikos Perdikopanis, Antonis Giannakakis, Ioannis Kavakiotis, A. Hatzigeorgiou
{"title":"D-sORF: Accurate Ab Initio Classification of Experimentally Detected Small Open Reading Frames (sORFs) Associated with Translational Machinery","authors":"Nikos Perdikopanis, Antonis Giannakakis, Ioannis Kavakiotis, A. Hatzigeorgiou","doi":"10.3390/biology13080563","DOIUrl":null,"url":null,"abstract":"Small open reading frames (sORFs; <300 nucleotides or <100 amino acids) are widespread across all genomes, and an increasing variety of them appear to be translating from non-genic regions. Over the past few decades, peptides produced from sORFs have been identified as functional in various organisms, from bacteria to humans. Despite recent advances in next-generation sequencing and proteomics, accurate annotation and classification of sORFs remain a rate-limiting step toward reliable and high-throughput detection of small proteins from non-genic regions. Additionally, the cost of computational methods utilizing machine learning is lower than that of biological experiments, and they can be employed to detect sORFs, laying the groundwork for biological experiments. We present D-sORF, a machine-learning framework that integrates the statistical nucleotide context and motif information around the start codon to predict coding sORFs. D-sORF scores directly for coding identity and requires only the underlying genomic sequence, without incorporating parameters such as the conservation, which, in the case of sORFs, may increase the dispersion of scores within the significantly less conserved non-genic regions. D-sORF achieves 94.74% precision and 92.37% accuracy for small ORFs (using the 99 nt medium length window). When D-sORF is applied to sORFs associated with ribosomes, the identification of transcripts producing peptides (annotated by the Ensembl IDs) is similar to or superior to experimental methodologies based on ribosome-sequencing (Ribo-Seq) profiling. In parallel, the recognition of putative negative data, such as the intron-containing transcripts that associate with ribosomes, remains remarkably low, indicating that D-sORF could be efficiently applied to filter out false-positive sORFs from Ribo-Seq data because of the non-productive ribosomal binding or noise inherent in these protocols.","PeriodicalId":504576,"journal":{"name":"Biology","volume":"35 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/biology13080563","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Small open reading frames (sORFs; <300 nucleotides or <100 amino acids) are widespread across all genomes, and an increasing variety of them appear to be translating from non-genic regions. Over the past few decades, peptides produced from sORFs have been identified as functional in various organisms, from bacteria to humans. Despite recent advances in next-generation sequencing and proteomics, accurate annotation and classification of sORFs remain a rate-limiting step toward reliable and high-throughput detection of small proteins from non-genic regions. Additionally, the cost of computational methods utilizing machine learning is lower than that of biological experiments, and they can be employed to detect sORFs, laying the groundwork for biological experiments. We present D-sORF, a machine-learning framework that integrates the statistical nucleotide context and motif information around the start codon to predict coding sORFs. D-sORF scores directly for coding identity and requires only the underlying genomic sequence, without incorporating parameters such as the conservation, which, in the case of sORFs, may increase the dispersion of scores within the significantly less conserved non-genic regions. D-sORF achieves 94.74% precision and 92.37% accuracy for small ORFs (using the 99 nt medium length window). When D-sORF is applied to sORFs associated with ribosomes, the identification of transcripts producing peptides (annotated by the Ensembl IDs) is similar to or superior to experimental methodologies based on ribosome-sequencing (Ribo-Seq) profiling. In parallel, the recognition of putative negative data, such as the intron-containing transcripts that associate with ribosomes, remains remarkably low, indicating that D-sORF could be efficiently applied to filter out false-positive sORFs from Ribo-Seq data because of the non-productive ribosomal binding or noise inherent in these protocols.
D-sORF:对实验检测到的与转译机制相关的小开放阅读框(sORF)进行精确的 Ab Initio 分类
小开放阅读框(sORFs;小于 300 个核苷酸或小于 100 个氨基酸)广泛存在于所有基因组中,而且越来越多的小开放阅读框似乎是从非基因区翻译而来的。在过去的几十年中,从细菌到人类,由 sORFs 生成的肽已被确定在各种生物体中具有功能。尽管最近在下一代测序和蛋白质组学方面取得了进展,但对 sORFs 的准确注释和分类仍然是可靠和高通量检测来自非基因区的小蛋白质的一个限制性步骤。此外,利用机器学习的计算方法的成本低于生物实验的成本,它们可以用来检测 sORF,为生物实验奠定基础。我们提出的 D-sORF 是一种机器学习框架,它整合了起始密码子周围的统计核苷酸上下文和图案信息,以预测编码 sORF。D-sORF 直接对编码同一性进行评分,只需要底层基因组序列,而不需要整合诸如保守性等参数,因为在 sORF 的情况下,保守性可能会增加在保守性明显较低的非基因区域内的评分分散性。对于小型 ORF,D-sORF 的精确度达到 94.74%,准确度达到 92.37%(使用 99 nt 中等长度窗口)。当 D-sORF 应用于与核糖体相关的 sORF 时,对产生肽的转录本(由 Ensembl IDs 注释)的识别类似于或优于基于核糖体测序(Ribo-Seq)分析的实验方法。与此同时,对推定的阴性数据(如与核糖体结合的含内含子转录本)的识别率仍然很低,这表明 D-sORF 可以有效地用于从 Ribo-Seq 数据中过滤出假阳性 sORF,因为这些方案中固有的非生产性核糖体结合或噪音。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信