{"title":"原核生物基因预测的随机森林分类器","authors":"Raíssa Silva, K. Souza, F. Góes, Ronnie Alves","doi":"10.1109/BRACIS.2019.00101","DOIUrl":null,"url":null,"abstract":"Metagenomics is related to the study of microbial genomes, known as metagenomes, describing them through their microorganisms compositions, relationships and activities, thus allowing a greater knowledge about the fundamentals of life and the broad microbial diversity. One way to accomplish such task is by analyzing information from genes contained in metagenomes. The process to identify genes in DNA sequences are usually called gene prediction. This work presents a new gene predictor using the Random Forest classifier. The proposed model obtaining better classification results when compared to state-of-the-art gene prediction tools widely used by the bioinformatics community. Random Forest presented more robust results, being 27% better than Prodigal and 20% better than FragGeneScan w.r.t AUC values while using the independent test set. Feature engineering has been revisited in the gene prediction problem, reinforcing the importance of careful evaluation of assembly a good feature set. K-mer counting features can been seen as the fundamental model building blocks to develop robust gene predictors.","PeriodicalId":335206,"journal":{"name":"Brazilian Conference on Intelligent Systems","volume":"248 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Random Forest Classifier for Prokaryotes Gene Prediction\",\"authors\":\"Raíssa Silva, K. Souza, F. Góes, Ronnie Alves\",\"doi\":\"10.1109/BRACIS.2019.00101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Metagenomics is related to the study of microbial genomes, known as metagenomes, describing them through their microorganisms compositions, relationships and activities, thus allowing a greater knowledge about the fundamentals of life and the broad microbial diversity. One way to accomplish such task is by analyzing information from genes contained in metagenomes. The process to identify genes in DNA sequences are usually called gene prediction. This work presents a new gene predictor using the Random Forest classifier. The proposed model obtaining better classification results when compared to state-of-the-art gene prediction tools widely used by the bioinformatics community. Random Forest presented more robust results, being 27% better than Prodigal and 20% better than FragGeneScan w.r.t AUC values while using the independent test set. Feature engineering has been revisited in the gene prediction problem, reinforcing the importance of careful evaluation of assembly a good feature set. K-mer counting features can been seen as the fundamental model building blocks to develop robust gene predictors.\",\"PeriodicalId\":335206,\"journal\":{\"name\":\"Brazilian Conference on Intelligent Systems\",\"volume\":\"248 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Brazilian Conference on Intelligent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BRACIS.2019.00101\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Brazilian Conference on Intelligent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2019.00101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Random Forest Classifier for Prokaryotes Gene Prediction
Metagenomics is related to the study of microbial genomes, known as metagenomes, describing them through their microorganisms compositions, relationships and activities, thus allowing a greater knowledge about the fundamentals of life and the broad microbial diversity. One way to accomplish such task is by analyzing information from genes contained in metagenomes. The process to identify genes in DNA sequences are usually called gene prediction. This work presents a new gene predictor using the Random Forest classifier. The proposed model obtaining better classification results when compared to state-of-the-art gene prediction tools widely used by the bioinformatics community. Random Forest presented more robust results, being 27% better than Prodigal and 20% better than FragGeneScan w.r.t AUC values while using the independent test set. Feature engineering has been revisited in the gene prediction problem, reinforcing the importance of careful evaluation of assembly a good feature set. K-mer counting features can been seen as the fundamental model building blocks to develop robust gene predictors.