D J Champion, Ting-Hsuan Chen, Susan Thomson, Michael A Black, Paul P Gardner
{"title":"评估蛋白质编码序列检测的计算工具:它们能胜任任务吗?","authors":"D J Champion, Ting-Hsuan Chen, Susan Thomson, Michael A Black, Paul P Gardner","doi":"10.1261/rna.080416.125","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Detecting protein coding genes in nucleotide sequences is a significant challenge for understanding genome and transcriptome function, yet the reliability of bioinformatic tools for this task remains largely unverified. This is despite some tools being available for several decades, and widely used for genome and transcriptome annotation.</p><p><strong>Results: </strong>We perform an assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets. Our work demonstrates that several widely used tools are neither accurate nor computationally efficient for the protein-coding sequence detection problem. In fact, just three of nine tools significantly outperformed a naive scoring scheme. Furthermore, we note a high discrepancy between self-reported accuracies and the accuracy achieved in our study. Our results show that the extra dimension from conserved and variable nucleotides in alignments have a significant advantage over single sequence approaches.</p><p><strong>Conclusions: </strong>These results highlight significant limitations in existing protein-coding annotation tools that are widely used for lncRNA annotation. This shows a need for more robust and efficient approaches to training and assessing the performance of tools for identifying protein-coding sequences. Our study paves the way for future advancements in comparative genomic approaches and we hope will popularise more robust approaches to genome and transcriptome annotation.</p>","PeriodicalId":21401,"journal":{"name":"RNA","volume":" ","pages":""},"PeriodicalIF":4.2000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating computational tools for protein-coding sequence detection: Are they up to the task?\",\"authors\":\"D J Champion, Ting-Hsuan Chen, Susan Thomson, Michael A Black, Paul P Gardner\",\"doi\":\"10.1261/rna.080416.125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Detecting protein coding genes in nucleotide sequences is a significant challenge for understanding genome and transcriptome function, yet the reliability of bioinformatic tools for this task remains largely unverified. This is despite some tools being available for several decades, and widely used for genome and transcriptome annotation.</p><p><strong>Results: </strong>We perform an assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets. Our work demonstrates that several widely used tools are neither accurate nor computationally efficient for the protein-coding sequence detection problem. In fact, just three of nine tools significantly outperformed a naive scoring scheme. Furthermore, we note a high discrepancy between self-reported accuracies and the accuracy achieved in our study. Our results show that the extra dimension from conserved and variable nucleotides in alignments have a significant advantage over single sequence approaches.</p><p><strong>Conclusions: </strong>These results highlight significant limitations in existing protein-coding annotation tools that are widely used for lncRNA annotation. This shows a need for more robust and efficient approaches to training and assessing the performance of tools for identifying protein-coding sequences. Our study paves the way for future advancements in comparative genomic approaches and we hope will popularise more robust approaches to genome and transcriptome annotation.</p>\",\"PeriodicalId\":21401,\"journal\":{\"name\":\"RNA\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"RNA\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1261/rna.080416.125\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"RNA","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1261/rna.080416.125","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
Evaluating computational tools for protein-coding sequence detection: Are they up to the task?
Background: Detecting protein coding genes in nucleotide sequences is a significant challenge for understanding genome and transcriptome function, yet the reliability of bioinformatic tools for this task remains largely unverified. This is despite some tools being available for several decades, and widely used for genome and transcriptome annotation.
Results: We perform an assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets. Our work demonstrates that several widely used tools are neither accurate nor computationally efficient for the protein-coding sequence detection problem. In fact, just three of nine tools significantly outperformed a naive scoring scheme. Furthermore, we note a high discrepancy between self-reported accuracies and the accuracy achieved in our study. Our results show that the extra dimension from conserved and variable nucleotides in alignments have a significant advantage over single sequence approaches.
Conclusions: These results highlight significant limitations in existing protein-coding annotation tools that are widely used for lncRNA annotation. This shows a need for more robust and efficient approaches to training and assessing the performance of tools for identifying protein-coding sequences. Our study paves the way for future advancements in comparative genomic approaches and we hope will popularise more robust approaches to genome and transcriptome annotation.
期刊介绍:
RNA is a monthly journal which provides rapid publication of significant original research in all areas of RNA structure and function in eukaryotic, prokaryotic, and viral systems. It covers a broad range of subjects in RNA research, including: structural analysis by biochemical or biophysical means; mRNA structure, function and biogenesis; alternative processing: cis-acting elements and trans-acting factors; ribosome structure and function; translational control; RNA catalysis; tRNA structure, function, biogenesis and identity; RNA editing; rRNA structure, function and biogenesis; RNA transport and localization; regulatory RNAs; large and small RNP structure, function and biogenesis; viral RNA metabolism; RNA stability and turnover; in vitro evolution; and RNA chemistry.