Protein classification using a neural network database system

Cathy H. Wu, T. Chang
{"title":"Protein classification using a neural network database system","authors":"Cathy H. Wu, T. Chang","doi":"10.1145/106965.105260","DOIUrl":null,"url":null,"abstract":"Neural networks are being applied to a widely expanding area of applications, including the biological applications of protein structure prediction and DNA sequence analysis. This paper describes a novel application of neural networks to the classification of the immense amounts of sequencing data being generated by the Human Genome Project and genetic engineering research. The protein classification is an alternative approach to the large database search problem so that the search time is not constrained by the database size. Previously, we have implemented a prototype protein classification system, PRO CANS, and demonstrated rapid and accurate allocations of 30 protein classes. This research scales up the pilot system into a “neurat database” system and aims at the classification of unknown protein sequences into 2,350 protein superfamilies (classes) currently being identified in the PIR (Protein Identification Resources) protein sequence database. The neural network protein database (NNPDB) system involves two major design principles: (a) a sequence encoding schema to effectively retrieve salient information from sequence strings, and (b) a modular network architecture to store the huge amount of training patterns. The complete NNPDB program, which includes preprocessor for sequence encoding, neural network for classification, and postprocessor for report generation, has been implemented on a CONVEX/CRAY computer platform. The NNPDB system is developed incrementally by training and optimizing each network module. After the training of 200 to 13,000 CRAY CPU seconds for the network modules, the system is able to predict within three CPU F’mnissiott to copy without fee all or part of this material is granted provided that Ute cople! are nol made or disrribmed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that co ying is by $ permission of rhe Association for Computing Machinery. o copy otfrerwise, or to repubtish requires a fee end/or specific permission. seconds with a 90 to 99% accuracy for the two protein groups tested, the electron transfer proteins and the oxidoreductases. In addition to the accuracy and speed of classification, the system architecture permits the identification of salient sequence information and flexible database growth and update. The neural database, which consists of a set of weight matrices of the networks, can be portable to other computers for speedy on-line anatysis of new sequences, and directly benefits the biology community. Furthermore, the system design should be easily adaptable for the information processing of other large and complex domains.","PeriodicalId":359315,"journal":{"name":"conference on Analysis of Neural Network Applications","volume":"407 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1991-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"conference on Analysis of Neural Network Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/106965.105260","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Neural networks are being applied to a widely expanding area of applications, including the biological applications of protein structure prediction and DNA sequence analysis. This paper describes a novel application of neural networks to the classification of the immense amounts of sequencing data being generated by the Human Genome Project and genetic engineering research. The protein classification is an alternative approach to the large database search problem so that the search time is not constrained by the database size. Previously, we have implemented a prototype protein classification system, PRO CANS, and demonstrated rapid and accurate allocations of 30 protein classes. This research scales up the pilot system into a “neurat database” system and aims at the classification of unknown protein sequences into 2,350 protein superfamilies (classes) currently being identified in the PIR (Protein Identification Resources) protein sequence database. The neural network protein database (NNPDB) system involves two major design principles: (a) a sequence encoding schema to effectively retrieve salient information from sequence strings, and (b) a modular network architecture to store the huge amount of training patterns. The complete NNPDB program, which includes preprocessor for sequence encoding, neural network for classification, and postprocessor for report generation, has been implemented on a CONVEX/CRAY computer platform. The NNPDB system is developed incrementally by training and optimizing each network module. After the training of 200 to 13,000 CRAY CPU seconds for the network modules, the system is able to predict within three CPU F’mnissiott to copy without fee all or part of this material is granted provided that Ute cople! are nol made or disrribmed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that co ying is by $ permission of rhe Association for Computing Machinery. o copy otfrerwise, or to repubtish requires a fee end/or specific permission. seconds with a 90 to 99% accuracy for the two protein groups tested, the electron transfer proteins and the oxidoreductases. In addition to the accuracy and speed of classification, the system architecture permits the identification of salient sequence information and flexible database growth and update. The neural database, which consists of a set of weight matrices of the networks, can be portable to other computers for speedy on-line anatysis of new sequences, and directly benefits the biology community. Furthermore, the system design should be easily adaptable for the information processing of other large and complex domains.
基于神经网络的蛋白质分类数据库系统
神经网络正被应用于广泛扩展的应用领域,包括蛋白质结构预测和DNA序列分析的生物学应用。本文描述了神经网络在人类基因组计划和基因工程研究中产生的大量测序数据分类中的一种新应用。蛋白质分类是大型数据库搜索问题的一种替代方法,因此搜索时间不受数据库大小的限制。之前,我们已经实现了一个原型蛋白质分类系统,PRO CANS,并演示了30个蛋白质类别的快速准确分配。本研究将试点系统扩展为“神经数据库”系统,旨在将未知蛋白质序列分类为目前在PIR(蛋白质鉴定资源)蛋白质序列数据库中鉴定的2350个蛋白质超家族(类)。神经网络蛋白质数据库(NNPDB)系统涉及两个主要设计原则:(a)序列编码模式,以有效地从序列字符串中检索显著信息;(b)模块化网络架构,以存储大量的训练模式。完整的NNPDB程序,包括用于序列编码的预处理器、用于分类的神经网络和用于报告生成的后处理器,已在一个CONVEX/CRAY计算机平台上实现。NNPDB系统是通过对每个网络模块的训练和优化逐步开发的。经过200 ~ 13000 CRAY CPU秒对网络模块的训练,系统能够在3个CPU秒内预测到可以免费复制全部或部分该材料,前提是允许用户复制!不得为直接商业利益而制作或分发,须注明ACM版权声明、出版物的标题和出版日期,并注明转载已获得计算机协会的许可。不得以其他方式复制,或需要付费或特定许可才能复制。对电子转移蛋白和氧化还原酶这两组蛋白质进行测试,准确率达到90%到99%。除了分类的准确性和速度外,该系统架构还允许识别显著序列信息和灵活的数据库增长和更新。该神经数据库由一组网络权重矩阵组成,可移植到其他计算机上,用于快速在线分析新序列,并直接造福于生物界。此外,系统设计应易于适应其他大型和复杂领域的信息处理。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信