A top-down approach to classify enzyme functional classes and sub-classes using random forest.

EURASIP journal on bioinformatics & systems biology Pub Date : 2012-02-29 DOI:10.1186/1687-4153-2012-1

Chetan Kumar, Alok Choudhary

{"title":"A top-down approach to classify enzyme functional classes and sub-classes using random forest.","authors":"Chetan Kumar, Alok Choudhary","doi":"10.1186/1687-4153-2012-1","DOIUrl":null,"url":null,"abstract":"<p><p> Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class of enzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes.</p>","PeriodicalId":72957,"journal":{"name":"EURASIP journal on bioinformatics & systems biology","volume":"2012 1","pages":"1"},"PeriodicalIF":0.0000,"publicationDate":"2012-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/1687-4153-2012-1","citationCount":"47","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EURASIP journal on bioinformatics & systems biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/1687-4153-2012-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

Abstract

Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class of enzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes.

Abstract Image

查看原文本刊更多论文

一种自上而下的酶功能类和亚类随机森林分类方法。

测序技术的进步见证了新发现酶数量的指数级增长。酶是催化生物化学反应的蛋白质，在代谢途径中起重要作用。通常，这些酶的功能是通过实验来确定的，这可能是耗时和昂贵的。因此，需要一种能够区分蛋白质酶序列和非酶序列并可靠地预测前者功能的计算方法。为了解决这个问题，已经提出了基于序列和结构相似性的聚类酶的方法。但是，已知这些方法对于执行相同功能且序列和结构不同的蛋白质是失败的。在本文中，我们提出了一个基于73个序列衍生特征集的监督机器学习模型来预测酶的功能类和亚类。功能类由国际生物化学与分子生物学联合会定义。采用一种高效的随机森林数据挖掘算法，构建了自顶向下的三层模型，其中顶层将查询蛋白序列分类为酶或非酶，第二层预测主功能类，底层进一步预测子功能类。该模型第一层次的总体分类准确率为94.87%，第二层次为87.7%，最后层次为84.25%。我们的结果与现有的方法比较非常好，并且在许多情况下报告了更好的性能。使用特征选择方法，我们展示了一些顶级属性的生物学相关性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

EURASIP journal on bioinformatics & systems biology

自引率

0.00%

发文量