Predicting "Essential" Genes across Microbial Genomes: A Machine Learning Approach

2011 10th International Conference on Machine Learning and Applications and Workshops Pub Date : 2011-12-18 DOI:10.1109/ICMLA.2011.114

Krishna Palaniappan, Sumitra Mukherjee

{"title":"Predicting \"Essential\" Genes across Microbial Genomes: A Machine Learning Approach","authors":"Krishna Palaniappan, Sumitra Mukherjee","doi":"10.1109/ICMLA.2011.114","DOIUrl":null,"url":null,"abstract":"Essential genes constitute the minimal set of genes an organism needs for its survival. Identification of essential genes is of theoretical interest to genome biologist and has practical applications in medicine and biotechnology. This paper presents and evaluates machine learning approaches to the problem of predicting essential genes in microbial genomes using solely sequence derived input features. We investigate three different supervised classification methods -- Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Tree (DT) -- for this binary classification task. The classifiers are trained and evaluated using 37830 examples obtained from 14 experimentally validated, taxonomically diverse microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features is used as input for the classifiers. The models were evaluated using novel blind testing schemes Leave-One-Genome-Out (LOGO) and Leave-One-Taxon-group-Out (LOTO) and 10-fold stratified cross validation (10-f-cv) strategy on both the full multi-genome datasets and its class imbalance reduced variants. Experimental results (10 X 10-f-cv) indicate SVM and ANN perform better than DT with Area under the Receiver Operating Characteristics (AU-ROC) scores of 0.80, 0.79 and 0.68 respectively. This study demonstrates that supervised machine learning methods can be used to predict essential genes in microbial genomes by using only gene sequence and features derived from it. LOGO and LOTO Blind test results suggest that the trained classifiers generalize across genomes and taxonomic boundaries.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 10th International Conference on Machine Learning and Applications and Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2011.114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Essential genes constitute the minimal set of genes an organism needs for its survival. Identification of essential genes is of theoretical interest to genome biologist and has practical applications in medicine and biotechnology. This paper presents and evaluates machine learning approaches to the problem of predicting essential genes in microbial genomes using solely sequence derived input features. We investigate three different supervised classification methods -- Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Tree (DT) -- for this binary classification task. The classifiers are trained and evaluated using 37830 examples obtained from 14 experimentally validated, taxonomically diverse microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features is used as input for the classifiers. The models were evaluated using novel blind testing schemes Leave-One-Genome-Out (LOGO) and Leave-One-Taxon-group-Out (LOTO) and 10-fold stratified cross validation (10-f-cv) strategy on both the full multi-genome datasets and its class imbalance reduced variants. Experimental results (10 X 10-f-cv) indicate SVM and ANN perform better than DT with Area under the Receiver Operating Characteristics (AU-ROC) scores of 0.80, 0.79 and 0.68 respectively. This study demonstrates that supervised machine learning methods can be used to predict essential genes in microbial genomes by using only gene sequence and features derived from it. LOGO and LOTO Blind test results suggest that the trained classifiers generalize across genomes and taxonomic boundaries.

查看原文本刊更多论文

预测微生物基因组中的“必要”基因:一种机器学习方法

基本基因构成了生物体生存所需的最小基因集。必需基因的鉴定是基因组生物学家的理论兴趣，在医学和生物技术方面具有实际应用。本文提出并评估了机器学习方法来预测微生物基因组中仅使用序列衍生输入特征的基本基因的问题。我们研究了三种不同的监督分类方法——支持向量机(SVM)、人工神经网络(ANN)和决策树(DT)——用于这个二元分类任务。分类器的训练和评估使用了37830个样本，这些样本来自14个经过实验验证的、分类上多样化的微生物基因组，这些基因组的基本基因是已知的。一组52个相关的基因组序列衍生特征被用作分类器的输入。采用新颖的盲检验方案Leave-One-Genome-Out (LOGO)和Leave-One-Taxon-group-Out (LOTO)，以及10倍分层交叉验证(10-f-cv)策略，对完整的多基因组数据集及其类失衡减少的变体进行了模型评估。实验结果(10 X 10-f-cv)表明，SVM和ANN在Receiver Operating characteristic (AU-ROC)下的面积分别为0.80、0.79和0.68，优于DT。本研究表明，监督机器学习方法可以通过仅使用基因序列和从中衍生的特征来预测微生物基因组中的必需基因。LOGO和LOTO盲测试结果表明，训练的分类器可以跨基因组和分类边界进行泛化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 10th International Conference on Machine Learning and Applications and Workshops

自引率

0.00%

发文量