{"title":"Predicting \"Essential\" Genes across Microbial Genomes: A Machine Learning Approach","authors":"Krishna Palaniappan, Sumitra Mukherjee","doi":"10.1109/ICMLA.2011.114","DOIUrl":null,"url":null,"abstract":"Essential genes constitute the minimal set of genes an organism needs for its survival. Identification of essential genes is of theoretical interest to genome biologist and has practical applications in medicine and biotechnology. This paper presents and evaluates machine learning approaches to the problem of predicting essential genes in microbial genomes using solely sequence derived input features. We investigate three different supervised classification methods -- Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Tree (DT) -- for this binary classification task. The classifiers are trained and evaluated using 37830 examples obtained from 14 experimentally validated, taxonomically diverse microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features is used as input for the classifiers. The models were evaluated using novel blind testing schemes Leave-One-Genome-Out (LOGO) and Leave-One-Taxon-group-Out (LOTO) and 10-fold stratified cross validation (10-f-cv) strategy on both the full multi-genome datasets and its class imbalance reduced variants. Experimental results (10 X 10-f-cv) indicate SVM and ANN perform better than DT with Area under the Receiver Operating Characteristics (AU-ROC) scores of 0.80, 0.79 and 0.68 respectively. This study demonstrates that supervised machine learning methods can be used to predict essential genes in microbial genomes by using only gene sequence and features derived from it. LOGO and LOTO Blind test results suggest that the trained classifiers generalize across genomes and taxonomic boundaries.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 10th International Conference on Machine Learning and Applications and Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2011.114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
Essential genes constitute the minimal set of genes an organism needs for its survival. Identification of essential genes is of theoretical interest to genome biologist and has practical applications in medicine and biotechnology. This paper presents and evaluates machine learning approaches to the problem of predicting essential genes in microbial genomes using solely sequence derived input features. We investigate three different supervised classification methods -- Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Tree (DT) -- for this binary classification task. The classifiers are trained and evaluated using 37830 examples obtained from 14 experimentally validated, taxonomically diverse microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features is used as input for the classifiers. The models were evaluated using novel blind testing schemes Leave-One-Genome-Out (LOGO) and Leave-One-Taxon-group-Out (LOTO) and 10-fold stratified cross validation (10-f-cv) strategy on both the full multi-genome datasets and its class imbalance reduced variants. Experimental results (10 X 10-f-cv) indicate SVM and ANN perform better than DT with Area under the Receiver Operating Characteristics (AU-ROC) scores of 0.80, 0.79 and 0.68 respectively. This study demonstrates that supervised machine learning methods can be used to predict essential genes in microbial genomes by using only gene sequence and features derived from it. LOGO and LOTO Blind test results suggest that the trained classifiers generalize across genomes and taxonomic boundaries.