Ying Xu, G. Helt, J. Einstein, G. Rubin, E. Uberbacher
{"title":"Drosophila GRAIL: an intelligent system for gene recognition in Drosophila DNA sequences","authors":"Ying Xu, G. Helt, J. Einstein, G. Rubin, E. Uberbacher","doi":"10.1109/INBS.1995.404269","DOIUrl":null,"url":null,"abstract":"An AI-based system for gene recognition in Drosophila DNA sequences was designed and implemented. The system consists of two main modules, one for coding exon recognition and one for single gene model construction. The exon recognition module finds a coding exon by recognition of its splice junctions (or translation start) and coding potential. The core of this module is a set of neural networks which evaluate an exon candidate for the possibility of being a true coding exon using the \"recognized\" splice junction (or translation start) and coding signals. The recognition process consists of four steps: generation of an exon candidate pool, elimination of improbable candidates using heuristic rules, candidate evaluation by trained neural networks, and candidate cluster resolution and final exon prediction. The gene model construction module takes as input the clustered exon candidates and builds a \"best\" possible single gene model using an efficient dynamic programming algorithm. 129 Drosophila sequences consisting of 441 coding exons including 216358 coding bases were extracted from GenBank and used to build statistical matrices and to train the neural networks. On this training set the system recognized 97% of the coding messages and predicted only 5% false messages. Among the \"correctly\" predicted exons, 68% match the actual exon exactly and 96% have at least one edge predicted correctly. On an independent test set consisting of 30 Drosophila sequences, the system recognized 96% of the coding messages and predicted 7% false messages.<<ETX>>","PeriodicalId":423954,"journal":{"name":"Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS'95","volume":"95 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS'95","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INBS.1995.404269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
An AI-based system for gene recognition in Drosophila DNA sequences was designed and implemented. The system consists of two main modules, one for coding exon recognition and one for single gene model construction. The exon recognition module finds a coding exon by recognition of its splice junctions (or translation start) and coding potential. The core of this module is a set of neural networks which evaluate an exon candidate for the possibility of being a true coding exon using the "recognized" splice junction (or translation start) and coding signals. The recognition process consists of four steps: generation of an exon candidate pool, elimination of improbable candidates using heuristic rules, candidate evaluation by trained neural networks, and candidate cluster resolution and final exon prediction. The gene model construction module takes as input the clustered exon candidates and builds a "best" possible single gene model using an efficient dynamic programming algorithm. 129 Drosophila sequences consisting of 441 coding exons including 216358 coding bases were extracted from GenBank and used to build statistical matrices and to train the neural networks. On this training set the system recognized 97% of the coding messages and predicted only 5% false messages. Among the "correctly" predicted exons, 68% match the actual exon exactly and 96% have at least one edge predicted correctly. On an independent test set consisting of 30 Drosophila sequences, the system recognized 96% of the coding messages and predicted 7% false messages.<>