评估蛋白质编码序列检测的计算工具:它们能胜任任务吗?

IF 4.2 3区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY
RNA Pub Date : 2025-06-11 DOI:10.1261/rna.080416.125
D J Champion, Ting-Hsuan Chen, Susan Thomson, Michael A Black, Paul P Gardner
{"title":"评估蛋白质编码序列检测的计算工具:它们能胜任任务吗?","authors":"D J Champion, Ting-Hsuan Chen, Susan Thomson, Michael A Black, Paul P Gardner","doi":"10.1261/rna.080416.125","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Detecting protein coding genes in nucleotide sequences is a significant challenge for understanding genome and transcriptome function, yet the reliability of bioinformatic tools for this task remains largely unverified. This is despite some tools being available for several decades, and widely used for genome and transcriptome annotation.</p><p><strong>Results: </strong>We perform an assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets. Our work demonstrates that several widely used tools are neither accurate nor computationally efficient for the protein-coding sequence detection problem. In fact, just three of nine tools significantly outperformed a naive scoring scheme. Furthermore, we note a high discrepancy between self-reported accuracies and the accuracy achieved in our study. Our results show that the extra dimension from conserved and variable nucleotides in alignments have a significant advantage over single sequence approaches.</p><p><strong>Conclusions: </strong>These results highlight significant limitations in existing protein-coding annotation tools that are widely used for lncRNA annotation. This shows a need for more robust and efficient approaches to training and assessing the performance of tools for identifying protein-coding sequences. Our study paves the way for future advancements in comparative genomic approaches and we hope will popularise more robust approaches to genome and transcriptome annotation.</p>","PeriodicalId":21401,"journal":{"name":"RNA","volume":" ","pages":""},"PeriodicalIF":4.2000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating computational tools for protein-coding sequence detection: Are they up to the task?\",\"authors\":\"D J Champion, Ting-Hsuan Chen, Susan Thomson, Michael A Black, Paul P Gardner\",\"doi\":\"10.1261/rna.080416.125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Detecting protein coding genes in nucleotide sequences is a significant challenge for understanding genome and transcriptome function, yet the reliability of bioinformatic tools for this task remains largely unverified. This is despite some tools being available for several decades, and widely used for genome and transcriptome annotation.</p><p><strong>Results: </strong>We perform an assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets. Our work demonstrates that several widely used tools are neither accurate nor computationally efficient for the protein-coding sequence detection problem. In fact, just three of nine tools significantly outperformed a naive scoring scheme. Furthermore, we note a high discrepancy between self-reported accuracies and the accuracy achieved in our study. Our results show that the extra dimension from conserved and variable nucleotides in alignments have a significant advantage over single sequence approaches.</p><p><strong>Conclusions: </strong>These results highlight significant limitations in existing protein-coding annotation tools that are widely used for lncRNA annotation. This shows a need for more robust and efficient approaches to training and assessing the performance of tools for identifying protein-coding sequences. Our study paves the way for future advancements in comparative genomic approaches and we hope will popularise more robust approaches to genome and transcriptome annotation.</p>\",\"PeriodicalId\":21401,\"journal\":{\"name\":\"RNA\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"RNA\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1261/rna.080416.125\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"RNA","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1261/rna.080416.125","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

背景:检测核苷酸序列中的蛋白质编码基因是理解基因组和转录组功能的重大挑战,然而用于这项任务的生物信息学工具的可靠性在很大程度上仍未得到验证。尽管一些工具已经可用了几十年,并且广泛用于基因组和转录组注释。结果:我们对核苷酸序列和基于比对的从头蛋白质编码检测工具进行了评估。我们使用的控制排除了任何以前的训练数据集,并将编码外显子作为正集,将长度匹配的基因间序列和洗牌序列作为负集。我们的工作表明,一些广泛使用的工具既不准确也不计算效率的蛋白质编码序列检测问题。事实上,九种工具中只有三种的表现明显优于单纯的评分方案。此外,我们注意到自我报告的准确性与我们研究中实现的准确性之间存在很大差异。我们的结果表明,额外的维度从保守的和可变的核苷酸比对具有比单序列方法显著的优势。结论:这些结果突出了广泛用于lncRNA注释的现有蛋白质编码注释工具的显著局限性。这表明需要更强大和有效的方法来训练和评估用于识别蛋白质编码序列的工具的性能。我们的研究为比较基因组方法的未来发展铺平了道路,我们希望将推广更强大的基因组和转录组注释方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluating computational tools for protein-coding sequence detection: Are they up to the task?

Background: Detecting protein coding genes in nucleotide sequences is a significant challenge for understanding genome and transcriptome function, yet the reliability of bioinformatic tools for this task remains largely unverified. This is despite some tools being available for several decades, and widely used for genome and transcriptome annotation.

Results: We perform an assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets. Our work demonstrates that several widely used tools are neither accurate nor computationally efficient for the protein-coding sequence detection problem. In fact, just three of nine tools significantly outperformed a naive scoring scheme. Furthermore, we note a high discrepancy between self-reported accuracies and the accuracy achieved in our study. Our results show that the extra dimension from conserved and variable nucleotides in alignments have a significant advantage over single sequence approaches.

Conclusions: These results highlight significant limitations in existing protein-coding annotation tools that are widely used for lncRNA annotation. This shows a need for more robust and efficient approaches to training and assessing the performance of tools for identifying protein-coding sequences. Our study paves the way for future advancements in comparative genomic approaches and we hope will popularise more robust approaches to genome and transcriptome annotation.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
RNA
RNA 生物-生化与分子生物学
CiteScore
8.30
自引率
2.20%
发文量
101
审稿时长
2.6 months
期刊介绍: RNA is a monthly journal which provides rapid publication of significant original research in all areas of RNA structure and function in eukaryotic, prokaryotic, and viral systems. It covers a broad range of subjects in RNA research, including: structural analysis by biochemical or biophysical means; mRNA structure, function and biogenesis; alternative processing: cis-acting elements and trans-acting factors; ribosome structure and function; translational control; RNA catalysis; tRNA structure, function, biogenesis and identity; RNA editing; rRNA structure, function and biogenesis; RNA transport and localization; regulatory RNAs; large and small RNP structure, function and biogenesis; viral RNA metabolism; RNA stability and turnover; in vitro evolution; and RNA chemistry.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信