Peptide programs: applying fragment programs to protein classification

Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI:10.1145/1458449.1458459

A. O. Falcão, Daniel Faria, António E. N. Ferreira

{"title":"Peptide programs: applying fragment programs to protein classification","authors":"A. O. Falcão, Daniel Faria, António E. N. Ferreira","doi":"10.1145/1458449.1458459","DOIUrl":null,"url":null,"abstract":"Functional prediction/classification of proteins is a central problem in bioinformatics. Alignment methods are a useful approach, but have limitations, which have prompted the development and use of machine learning approaches. However, traditional machine learning approaches are unable to exploit sequence data directly, and instead use derived sequence features or Kernel functions to obtain a feature space. Because theoretically all information necessary to predict a protein's structure and function is contained in its sequence, a methodology that could exploit sequence data directly could be advantageous. A novel machine learning methodology for protein classification, inspired in the concept of fragment programs, is presented. This methodology consists in assigning a minimal computer program to each of the 20 amino acids, and then representing a protein as the program resulting from applying sequentially the programs of the amino acids which compose its sequence. The basic concepts of the methodology presented (peptide programs) are discussed and a framework is proposed for their implementation, including instruction set, virtual machine, evaluation procedures and convergence methods. The methodology is tested in the binary classification of 33,500 enzymes into 182 distinct Enzyme Commission (EC) classes. The average Matthews correlation coefficient of the binary classifiers is 0.75 in training and 0.68 in validation. Overall, the results obtained demonstrate the potential of the proposed methodology, and its ability to extract knowledge from sequence data, using very few computational resources","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data and Text Mining in Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1458449.1458459","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Functional prediction/classification of proteins is a central problem in bioinformatics. Alignment methods are a useful approach, but have limitations, which have prompted the development and use of machine learning approaches. However, traditional machine learning approaches are unable to exploit sequence data directly, and instead use derived sequence features or Kernel functions to obtain a feature space. Because theoretically all information necessary to predict a protein's structure and function is contained in its sequence, a methodology that could exploit sequence data directly could be advantageous. A novel machine learning methodology for protein classification, inspired in the concept of fragment programs, is presented. This methodology consists in assigning a minimal computer program to each of the 20 amino acids, and then representing a protein as the program resulting from applying sequentially the programs of the amino acids which compose its sequence. The basic concepts of the methodology presented (peptide programs) are discussed and a framework is proposed for their implementation, including instruction set, virtual machine, evaluation procedures and convergence methods. The methodology is tested in the binary classification of 33,500 enzymes into 182 distinct Enzyme Commission (EC) classes. The average Matthews correlation coefficient of the binary classifiers is 0.75 in training and 0.68 in validation. Overall, the results obtained demonstrate the potential of the proposed methodology, and its ability to extract knowledge from sequence data, using very few computational resources

查看原文本刊更多论文

肽程序:将片段程序应用于蛋白质分类

蛋白质的功能预测/分类是生物信息学的核心问题。对齐方法是一种有用的方法，但有局限性，这促使了机器学习方法的发展和使用。然而，传统的机器学习方法不能直接利用序列数据，而是使用衍生的序列特征或核函数来获得特征空间。因为从理论上讲，预测蛋白质结构和功能所需的所有信息都包含在其序列中，因此可以直接利用序列数据的方法可能是有利的。在片段程序概念的启发下，提出了一种新的蛋白质分类机器学习方法。这种方法包括为20种氨基酸中的每一种分配一个最小的计算机程序，然后将组成其序列的氨基酸的程序依次应用，从而将蛋白质表示为程序。讨论了该方法的基本概念(肽程序)，并提出了实现框架，包括指令集、虚拟机、评估程序和收敛方法。该方法在33,500种酶的二元分类中进行了测试，这些酶分为182种不同的酶委员会(EC)类。二分类器的平均马修斯相关系数在训练时为0.75，在验证时为0.68。总体而言，获得的结果证明了所提出方法的潜力，以及它使用很少的计算资源从序列数据中提取知识的能力

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data and Text Mining in Bioinformatics

自引率

0.00%

发文量