{"title":"Protein classification using a neural network database system","authors":"Cathy H. Wu, T. Chang","doi":"10.1145/106965.105260","DOIUrl":null,"url":null,"abstract":"Neural networks are being applied to a widely expanding area of applications, including the biological applications of protein structure prediction and DNA sequence analysis. This paper describes a novel application of neural networks to the classification of the immense amounts of sequencing data being generated by the Human Genome Project and genetic engineering research. The protein classification is an alternative approach to the large database search problem so that the search time is not constrained by the database size. Previously, we have implemented a prototype protein classification system, PRO CANS, and demonstrated rapid and accurate allocations of 30 protein classes. This research scales up the pilot system into a “neurat database” system and aims at the classification of unknown protein sequences into 2,350 protein superfamilies (classes) currently being identified in the PIR (Protein Identification Resources) protein sequence database. The neural network protein database (NNPDB) system involves two major design principles: (a) a sequence encoding schema to effectively retrieve salient information from sequence strings, and (b) a modular network architecture to store the huge amount of training patterns. The complete NNPDB program, which includes preprocessor for sequence encoding, neural network for classification, and postprocessor for report generation, has been implemented on a CONVEX/CRAY computer platform. The NNPDB system is developed incrementally by training and optimizing each network module. After the training of 200 to 13,000 CRAY CPU seconds for the network modules, the system is able to predict within three CPU F’mnissiott to copy without fee all or part of this material is granted provided that Ute cople! are nol made or disrribmed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that co ying is by $ permission of rhe Association for Computing Machinery. o copy otfrerwise, or to repubtish requires a fee end/or specific permission. seconds with a 90 to 99% accuracy for the two protein groups tested, the electron transfer proteins and the oxidoreductases. In addition to the accuracy and speed of classification, the system architecture permits the identification of salient sequence information and flexible database growth and update. The neural database, which consists of a set of weight matrices of the networks, can be portable to other computers for speedy on-line anatysis of new sequences, and directly benefits the biology community. Furthermore, the system design should be easily adaptable for the information processing of other large and complex domains.","PeriodicalId":359315,"journal":{"name":"conference on Analysis of Neural Network Applications","volume":"407 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1991-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"conference on Analysis of Neural Network Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/106965.105260","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Neural networks are being applied to a widely expanding area of applications, including the biological applications of protein structure prediction and DNA sequence analysis. This paper describes a novel application of neural networks to the classification of the immense amounts of sequencing data being generated by the Human Genome Project and genetic engineering research. The protein classification is an alternative approach to the large database search problem so that the search time is not constrained by the database size. Previously, we have implemented a prototype protein classification system, PRO CANS, and demonstrated rapid and accurate allocations of 30 protein classes. This research scales up the pilot system into a “neurat database” system and aims at the classification of unknown protein sequences into 2,350 protein superfamilies (classes) currently being identified in the PIR (Protein Identification Resources) protein sequence database. The neural network protein database (NNPDB) system involves two major design principles: (a) a sequence encoding schema to effectively retrieve salient information from sequence strings, and (b) a modular network architecture to store the huge amount of training patterns. The complete NNPDB program, which includes preprocessor for sequence encoding, neural network for classification, and postprocessor for report generation, has been implemented on a CONVEX/CRAY computer platform. The NNPDB system is developed incrementally by training and optimizing each network module. After the training of 200 to 13,000 CRAY CPU seconds for the network modules, the system is able to predict within three CPU F’mnissiott to copy without fee all or part of this material is granted provided that Ute cople! are nol made or disrribmed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that co ying is by $ permission of rhe Association for Computing Machinery. o copy otfrerwise, or to repubtish requires a fee end/or specific permission. seconds with a 90 to 99% accuracy for the two protein groups tested, the electron transfer proteins and the oxidoreductases. In addition to the accuracy and speed of classification, the system architecture permits the identification of salient sequence information and flexible database growth and update. The neural database, which consists of a set of weight matrices of the networks, can be portable to other computers for speedy on-line anatysis of new sequences, and directly benefits the biology community. Furthermore, the system design should be easily adaptable for the information processing of other large and complex domains.