The state of the human coding gene catalogues.

IF 3.6 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation Pub Date : 2025-01-18 DOI:10.1093/database/baaf045

Miguel Maquedano, Daniel Cerdán-Vélez, Michael L Tress

{"title":"The state of the human coding gene catalogues.","authors":"Miguel Maquedano, Daniel Cerdán-Vélez, Michael L Tress","doi":"10.1093/database/baaf045","DOIUrl":null,"url":null,"abstract":"<p><p>In 2018, we analysed the three main repositories for the human proteome: Ensembl/GENCODE, RefSeq, and UniProtKB. At that time the three gene sets disagreed on the coding status of one of every eight annotated coding genes, and our results suggested that as many as 4234 of these genes might not be correctly classified. Here, we have repeated the analysis with updated versions of the three reference gene sets. Superficially, little appears to have changed. The three sets annotate 21 871 coding genes, slightly fewer than previously, and still disagree on the status of 2603 annotated genes, almost one in eight. However, we show that collaborations between the three reference gene sets have led to greater consensus. Reference catalogues have agreed on the coding status of another 249 genes since the last analysis while at least 700 genes have been reclassified. We still find that there are >2000 coding genes with at least one potential non-coding feature to indicate that they may not be coding genes. This includes a large majority of the 2603 genes for which annotators do not agree on coding status. In total, we believe that as many as 3000 genes may be misclassified as coding and could be annotated as non-coding genes, pseudogenes, or cancer antigens.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462614/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baaf045","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

In 2018, we analysed the three main repositories for the human proteome: Ensembl/GENCODE, RefSeq, and UniProtKB. At that time the three gene sets disagreed on the coding status of one of every eight annotated coding genes, and our results suggested that as many as 4234 of these genes might not be correctly classified. Here, we have repeated the analysis with updated versions of the three reference gene sets. Superficially, little appears to have changed. The three sets annotate 21 871 coding genes, slightly fewer than previously, and still disagree on the status of 2603 annotated genes, almost one in eight. However, we show that collaborations between the three reference gene sets have led to greater consensus. Reference catalogues have agreed on the coding status of another 249 genes since the last analysis while at least 700 genes have been reclassified. We still find that there are >2000 coding genes with at least one potential non-coding feature to indicate that they may not be coding genes. This includes a large majority of the 2603 genes for which annotators do not agree on coding status. In total, we believe that as many as 3000 genes may be misclassified as coding and could be annotated as non-coding genes, pseudogenes, or cancer antigens.

查看原文本刊更多论文

人类编码基因目录的现状。

2018年，我们分析了人类蛋白质组的三个主要存储库：Ensembl/GENCODE、RefSeq和UniProtKB。当时，三个基因组对每8个注释编码基因中有1个的编码状态存在分歧，我们的结果表明，这些基因中可能有多达4234个未被正确分类。在这里，我们用三个参考基因集的更新版本重复了分析。从表面上看，似乎没有什么变化。这三组被注释的编码基因有21871个，比之前的数量略少，但对2603个被注释基因的状态仍然存在分歧，几乎占八分之一。然而，我们表明，三个参考基因集之间的合作导致了更大的共识。自上次分析以来，参考目录已对另外249个基因的编码状态达成一致，同时至少有700个基因已被重新分类。我们仍然发现，至少有1000个编码基因具有一个潜在的非编码特征，这表明它们可能不是编码基因。这包括2603个基因中的大部分，注释者对编码状态没有达成一致。总的来说，我们认为多达3000个基因可能被错误地分类为编码基因，并可能被注释为非编码基因、假基因或癌症抗原。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

9.00

自引率

3.40%

发文量

100

审稿时长

>12 weeks

期刊介绍： Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.