利用已知外显子-内含子结构的同源基因进行基因结构预测。

Applied bioinformatics Pub Date : 2004-01-01 DOI:10.2165/00822942-200403020-00002

Stephanie Seneff, Chao Wang, Christopher B Burge

{"title":"利用已知外显子-内含子结构的同源基因进行基因结构预测。","authors":"Stephanie Seneff, Chao Wang, Christopher B Burge","doi":"10.2165/00822942-200403020-00002","DOIUrl":null,"url":null,"abstract":"Given the availability of complete genome sequences from related organisms, sequence conservation can provide important clues for predicting gene structure. In particular, one should be able to leverage information about known genes in one species to help determine the structures of related genes in another. Such an approach is appealing in that high-quality gene prediction can be achieved for newly sequenced species, such as mouse and puffer fish, using the extensive knowledge that has been accumulated about human genes. This article reports a novel approach to predicting the exon-intron structures of mouse genes by incorporating constraints from orthologous human genes using techniques that have previously been exploited in speech and natural language processing applications. The approach uses a context-free grammar to parse a training corpus of annotated human genes. A statistical training procedure produces a weighted recursive transition network (RTN) intended to capture the general features of a mammalian gene. This RTN is expanded into a finite state transducer (FST) and composed with an FST capturing the specific features of the human orthologue. This model includes a trigram language model on the amino acid sequence as well as exon length constraints. A final stage uses the free software package ClustalW to align the top n candidates in the search space. For a set of 98 orthologous human-mouse pairs, we achieved 96% sensitivity and 97% specificity at the exon level on the mouse genes, given only knowledge gleaned from the annotated human genome.","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 2-3","pages":"81-90"},"PeriodicalIF":0.0000,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403020-00002","citationCount":"7","resultStr":"{\"title\":\"Gene structure prediction using an orthologous gene of known exon-intron structure.\",\"authors\":\"Stephanie Seneff, Chao Wang, Christopher B Burge\",\"doi\":\"10.2165/00822942-200403020-00002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given the availability of complete genome sequences from related organisms, sequence conservation can provide important clues for predicting gene structure. In particular, one should be able to leverage information about known genes in one species to help determine the structures of related genes in another. Such an approach is appealing in that high-quality gene prediction can be achieved for newly sequenced species, such as mouse and puffer fish, using the extensive knowledge that has been accumulated about human genes. This article reports a novel approach to predicting the exon-intron structures of mouse genes by incorporating constraints from orthologous human genes using techniques that have previously been exploited in speech and natural language processing applications. The approach uses a context-free grammar to parse a training corpus of annotated human genes. A statistical training procedure produces a weighted recursive transition network (RTN) intended to capture the general features of a mammalian gene. This RTN is expanded into a finite state transducer (FST) and composed with an FST capturing the specific features of the human orthologue. This model includes a trigram language model on the amino acid sequence as well as exon length constraints. A final stage uses the free software package ClustalW to align the top n candidates in the search space. For a set of 98 orthologous human-mouse pairs, we achieved 96% sensitivity and 97% specificity at the exon level on the mouse genes, given only knowledge gleaned from the annotated human genome.\",\"PeriodicalId\":87049,\"journal\":{\"name\":\"Applied bioinformatics\",\"volume\":\"3 2-3\",\"pages\":\"81-90\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.2165/00822942-200403020-00002\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2165/00822942-200403020-00002\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2165/00822942-200403020-00002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

鉴于相关生物的完整基因组序列的可用性，序列保守可以为预测基因结构提供重要线索。特别是，人们应该能够利用一个物种中已知基因的信息来帮助确定另一个物种中相关基因的结构。这种方法很有吸引力，因为利用人类基因积累的广泛知识，可以对新测序的物种(如老鼠和河豚)实现高质量的基因预测。本文报道了一种预测小鼠基因外显子-内含子结构的新方法，该方法采用了先前在语音和自然语言处理应用中开发的技术，通过结合同源人类基因的约束来预测小鼠基因的外显子-内含子结构。该方法使用上下文无关的语法来解析带注释的人类基因的训练语料库。统计训练程序产生加权递归转换网络(RTN)，旨在捕捉哺乳动物基因的一般特征。该RTN扩展为有限状态传感器(FST)，并与捕获人类同源物特定特征的FST组成。该模型包括氨基酸序列的三元语言模型以及外显子长度约束。最后一个阶段是使用免费软件包ClustalW来排列搜索空间中的前n个候选对象。对于一组98对同源的人-鼠基因，我们在小鼠基因的外显子水平上获得了96%的灵敏度和97%的特异性，仅给出了从注释的人类基因组收集的知识。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Gene structure prediction using an orthologous gene of known exon-intron structure.

Given the availability of complete genome sequences from related organisms, sequence conservation can provide important clues for predicting gene structure. In particular, one should be able to leverage information about known genes in one species to help determine the structures of related genes in another. Such an approach is appealing in that high-quality gene prediction can be achieved for newly sequenced species, such as mouse and puffer fish, using the extensive knowledge that has been accumulated about human genes. This article reports a novel approach to predicting the exon-intron structures of mouse genes by incorporating constraints from orthologous human genes using techniques that have previously been exploited in speech and natural language processing applications. The approach uses a context-free grammar to parse a training corpus of annotated human genes. A statistical training procedure produces a weighted recursive transition network (RTN) intended to capture the general features of a mammalian gene. This RTN is expanded into a finite state transducer (FST) and composed with an FST capturing the specific features of the human orthologue. This model includes a trigram language model on the amino acid sequence as well as exon length constraints. A final stage uses the free software package ClustalW to align the top n candidates in the search space. For a set of 98 orthologous human-mouse pairs, we achieved 96% sensitivity and 97% specificity at the exon level on the mouse genes, given only knowledge gleaned from the annotated human genome.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied bioinformatics

自引率

0.00%

发文量