Aimee J. Schulz, Jingjing Zhai, Taylor AuBuchon-Elder, Carson M. Andorf, Mohamed Z. El-Walid, Taylor H. Ferebee, Elizabeth H. Gilmore, Matthew B. Hufford, Lynn C. Johnson, Elizabeth A. Kellogg, Thuy La, Evan Long, Zachary R. Miller, John L. Portwood II, M. Cinta Romay, Arun S. Seetharam, Michelle C. Stitzer, Margaret R. Woodhouse, Travis Wrightsman, Edward S. Buckler, Brandon Monier, Sheng-Kai Hsu
{"title":"钓一个卷基因:用进化和机器学习评估基因模型","authors":"Aimee J. Schulz, Jingjing Zhai, Taylor AuBuchon-Elder, Carson M. Andorf, Mohamed Z. El-Walid, Taylor H. Ferebee, Elizabeth H. Gilmore, Matthew B. Hufford, Lynn C. Johnson, Elizabeth A. Kellogg, Thuy La, Evan Long, Zachary R. Miller, John L. Portwood II, M. Cinta Romay, Arun S. Seetharam, Michelle C. Stitzer, Margaret R. Woodhouse, Travis Wrightsman, Edward S. Buckler, Brandon Monier, Sheng-Kai Hsu","doi":"10.1111/tpj.70483","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Assembled genomes and their associated annotations have transformed our study of gene function. However, each new annotated assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses the conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million transcript models in <i>Zea mays ssp. mays</i> (maize), reelGene classified 28% as incorrectly annotated or non-functional. We find that reelGene classifies 92.2% of genes in the maize proteome and 99.2% of genes within the maize classical gene list as functional. reelGene also provides a way to further investigate genome biology– for instance, reelGene indicates that 10.3% of dispensable genes in B73 are functional, and within retained duplicate genes, reelGene identifies a 30% bias toward the retention of the M1 subgenome when one copy is functional and the other is non-functional. As an annotation-evaluating tool, reelGene is directly applicable to species of the Andropogoneae tribe, including other important crops like sorghum and miscanthus. As a community resource, reelGene has been integrated onto MaizeGDB both as a browser track and as an individual Shiny App, allowing researchers to evaluate gene model accuracy and further investigate genome biology.</p>\n </div>","PeriodicalId":233,"journal":{"name":"The Plant Journal","volume":"123 6","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fishing for a reelGene: evaluating gene models with evolution and machine learning\",\"authors\":\"Aimee J. Schulz, Jingjing Zhai, Taylor AuBuchon-Elder, Carson M. Andorf, Mohamed Z. El-Walid, Taylor H. Ferebee, Elizabeth H. Gilmore, Matthew B. Hufford, Lynn C. Johnson, Elizabeth A. Kellogg, Thuy La, Evan Long, Zachary R. Miller, John L. Portwood II, M. Cinta Romay, Arun S. Seetharam, Michelle C. Stitzer, Margaret R. Woodhouse, Travis Wrightsman, Edward S. Buckler, Brandon Monier, Sheng-Kai Hsu\",\"doi\":\"10.1111/tpj.70483\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Assembled genomes and their associated annotations have transformed our study of gene function. However, each new annotated assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses the conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million transcript models in <i>Zea mays ssp. mays</i> (maize), reelGene classified 28% as incorrectly annotated or non-functional. We find that reelGene classifies 92.2% of genes in the maize proteome and 99.2% of genes within the maize classical gene list as functional. reelGene also provides a way to further investigate genome biology– for instance, reelGene indicates that 10.3% of dispensable genes in B73 are functional, and within retained duplicate genes, reelGene identifies a 30% bias toward the retention of the M1 subgenome when one copy is functional and the other is non-functional. As an annotation-evaluating tool, reelGene is directly applicable to species of the Andropogoneae tribe, including other important crops like sorghum and miscanthus. As a community resource, reelGene has been integrated onto MaizeGDB both as a browser track and as an individual Shiny App, allowing researchers to evaluate gene model accuracy and further investigate genome biology.</p>\\n </div>\",\"PeriodicalId\":233,\"journal\":{\"name\":\"The Plant Journal\",\"volume\":\"123 6\",\"pages\":\"\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Plant Journal\",\"FirstCategoryId\":\"2\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/tpj.70483\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PLANT SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Plant Journal","FirstCategoryId":"2","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/tpj.70483","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PLANT SCIENCES","Score":null,"Total":0}
Fishing for a reelGene: evaluating gene models with evolution and machine learning
Assembled genomes and their associated annotations have transformed our study of gene function. However, each new annotated assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses the conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million transcript models in Zea mays ssp. mays (maize), reelGene classified 28% as incorrectly annotated or non-functional. We find that reelGene classifies 92.2% of genes in the maize proteome and 99.2% of genes within the maize classical gene list as functional. reelGene also provides a way to further investigate genome biology– for instance, reelGene indicates that 10.3% of dispensable genes in B73 are functional, and within retained duplicate genes, reelGene identifies a 30% bias toward the retention of the M1 subgenome when one copy is functional and the other is non-functional. As an annotation-evaluating tool, reelGene is directly applicable to species of the Andropogoneae tribe, including other important crops like sorghum and miscanthus. As a community resource, reelGene has been integrated onto MaizeGDB both as a browser track and as an individual Shiny App, allowing researchers to evaluate gene model accuracy and further investigate genome biology.
期刊介绍:
Publishing the best original research papers in all key areas of modern plant biology from the world"s leading laboratories, The Plant Journal provides a dynamic forum for this ever growing international research community.
Plant science research is now at the forefront of research in the biological sciences, with breakthroughs in our understanding of fundamental processes in plants matching those in other organisms. The impact of molecular genetics and the availability of model and crop species can be seen in all aspects of plant biology. For publication in The Plant Journal the research must provide a highly significant new contribution to our understanding of plants and be of general interest to the plant science community.