Fishing for a reelGene: evaluating gene models with evolution and machine learning

IF 5.7 1区生物学 Q1 PLANT SCIENCES

The Plant Journal Pub Date : 2025-09-22 DOI:10.1111/tpj.70483

Aimee J. Schulz, Jingjing Zhai, Taylor AuBuchon-Elder, Carson M. Andorf, Mohamed Z. El-Walid, Taylor H. Ferebee, Elizabeth H. Gilmore, Matthew B. Hufford, Lynn C. Johnson, Elizabeth A. Kellogg, Thuy La, Evan Long, Zachary R. Miller, John L. Portwood II, M. Cinta Romay, Arun S. Seetharam, Michelle C. Stitzer, Margaret R. Woodhouse, Travis Wrightsman, Edward S. Buckler, Brandon Monier, Sheng-Kai Hsu

{"title":"Fishing for a reelGene: evaluating gene models with evolution and machine learning","authors":"Aimee J. Schulz, Jingjing Zhai, Taylor AuBuchon-Elder, Carson M. Andorf, Mohamed Z. El-Walid, Taylor H. Ferebee, Elizabeth H. Gilmore, Matthew B. Hufford, Lynn C. Johnson, Elizabeth A. Kellogg, Thuy La, Evan Long, Zachary R. Miller, John L. Portwood II, M. Cinta Romay, Arun S. Seetharam, Michelle C. Stitzer, Margaret R. Woodhouse, Travis Wrightsman, Edward S. Buckler, Brandon Monier, Sheng-Kai Hsu","doi":"10.1111/tpj.70483","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Assembled genomes and their associated annotations have transformed our study of gene function. However, each new annotated assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses the conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million transcript models in <i>Zea mays ssp. mays</i> (maize), reelGene classified 28% as incorrectly annotated or non-functional. We find that reelGene classifies 92.2% of genes in the maize proteome and 99.2% of genes within the maize classical gene list as functional. reelGene also provides a way to further investigate genome biology– for instance, reelGene indicates that 10.3% of dispensable genes in B73 are functional, and within retained duplicate genes, reelGene identifies a 30% bias toward the retention of the M1 subgenome when one copy is functional and the other is non-functional. As an annotation-evaluating tool, reelGene is directly applicable to species of the Andropogoneae tribe, including other important crops like sorghum and miscanthus. As a community resource, reelGene has been integrated onto MaizeGDB both as a browser track and as an individual Shiny App, allowing researchers to evaluate gene model accuracy and further investigate genome biology.</p>\n </div>","PeriodicalId":233,"journal":{"name":"The Plant Journal","volume":"123 6","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Plant Journal","FirstCategoryId":"2","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/tpj.70483","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PLANT SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Assembled genomes and their associated annotations have transformed our study of gene function. However, each new annotated assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses the conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million transcript models in Zea mays ssp. mays (maize), reelGene classified 28% as incorrectly annotated or non-functional. We find that reelGene classifies 92.2% of genes in the maize proteome and 99.2% of genes within the maize classical gene list as functional. reelGene also provides a way to further investigate genome biology– for instance, reelGene indicates that 10.3% of dispensable genes in B73 are functional, and within retained duplicate genes, reelGene identifies a 30% bias toward the retention of the M1 subgenome when one copy is functional and the other is non-functional. As an annotation-evaluating tool, reelGene is directly applicable to species of the Andropogoneae tribe, including other important crops like sorghum and miscanthus. As a community resource, reelGene has been integrated onto MaizeGDB both as a browser track and as an individual Shiny App, allowing researchers to evaluate gene model accuracy and further investigate genome biology.

查看原文本刊更多论文

钓一个卷基因：用进化和机器学习评估基因模型

组装基因组及其相关注释改变了我们对基因功能的研究。然而，每一个新的注释汇编都会产生新的基因模型。注释之间的不一致可能是由生物学和技术原因引起的，包括假基因错误分类、转座子活性和未剪接转录物测序中的内含子保留。为了评估基因模型预测，我们开发了reelGene，这是一个机器学习模型管道，专注于(1)转录边界，(2)mRNA完整性和(3)蛋白质结构。前两种模型利用序列特征和进化保守性来了解保守转录边界和mRNA序列的语法，而第三种模型利用蛋白质序列的保守进化语法来预测基因是否可以产生蛋白质。在Zea中评估了180万个转录模型。mays（玉米），reelGene将28%分类为错误注释或无功能。我们发现reelGene将玉米蛋白质组中92.2%的基因和玉米经典基因列表中99.2%的基因分类为功能性基因。reelGene还提供了一种进一步研究基因组生物学的方法——例如，reelGene表明B73中10.3%的必要基因是功能性的，在保留的重复基因中，reelGene发现当一个拷贝是功能性的而另一个是非功能性的时，M1亚基因组的保留有30%的偏差。reelGene作为一种注释评价工具，直接适用于Andropogoneae部落的物种，包括其他重要的作物，如高粱和芒草。作为一个社区资源，reelGene已经作为浏览器跟踪和个人闪亮应用程序集成到MaizeGDB中，允许研究人员评估基因模型的准确性并进一步研究基因组生物学。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Plant Journal 生物-植物科学

CiteScore

13.10

自引率

4.20%

发文量

415

审稿时长

2.3 months

期刊介绍： Publishing the best original research papers in all key areas of modern plant biology from the world"s leading laboratories, The Plant Journal provides a dynamic forum for this ever growing international research community. Plant science research is now at the forefront of research in the biological sciences, with breakthroughs in our understanding of fundamental processes in plants matching those in other organisms. The impact of molecular genetics and the availability of model and crop species can be seen in all aspects of plant biology. For publication in The Plant Journal the research must provide a highly significant new contribution to our understanding of plants and be of general interest to the plant science community.