{"title":"基于互信息和递归特征消除的肿瘤基因表达数据集成特征选择","authors":"Nimrita Koul, S. Manvi","doi":"10.1109/ICAECC50550.2020.9339518","DOIUrl":null,"url":null,"abstract":"Availability of high through put gene expression data has enabled computational analysis of it for early diagnosis of diseases like cancer. This data contains expression values of thousands of genes in the genome of an organism. However, this gene expression data is very high dimensional, one dimension each corresponding to one genes in the genome and very few of these genes are associated with a disease. At the same time, the number of samples or observations available is very small as compared to the number of features, also this data suffers from class imbalance. Therefore, the task of selecting the genes that are relevant to the disease being studies is an important task and being researched widely in the computational sciences. In this paper, we have proposed a randomized ensemble method for feature selection from cancer gene expression data using a combination of mutual information and recursive feature elimination. The approach has been applied on Leukemia gene expression dataset. We obtained a classification accuracy of 99% with a gene subset of size 316 genes and with a subset of size 4 the accuracy is 95%. Thus we achieved a dimensionality reduction of 98.5% with 99% accuracy. Comparison with standard methods shows that the proposed method performs better.","PeriodicalId":196343,"journal":{"name":"2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Ensemble Feature Selection from Cancer Gene Expression Data using Mutual Information and Recursive Feature Elimination\",\"authors\":\"Nimrita Koul, S. Manvi\",\"doi\":\"10.1109/ICAECC50550.2020.9339518\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Availability of high through put gene expression data has enabled computational analysis of it for early diagnosis of diseases like cancer. This data contains expression values of thousands of genes in the genome of an organism. However, this gene expression data is very high dimensional, one dimension each corresponding to one genes in the genome and very few of these genes are associated with a disease. At the same time, the number of samples or observations available is very small as compared to the number of features, also this data suffers from class imbalance. Therefore, the task of selecting the genes that are relevant to the disease being studies is an important task and being researched widely in the computational sciences. In this paper, we have proposed a randomized ensemble method for feature selection from cancer gene expression data using a combination of mutual information and recursive feature elimination. The approach has been applied on Leukemia gene expression dataset. We obtained a classification accuracy of 99% with a gene subset of size 316 genes and with a subset of size 4 the accuracy is 95%. Thus we achieved a dimensionality reduction of 98.5% with 99% accuracy. Comparison with standard methods shows that the proposed method performs better.\",\"PeriodicalId\":196343,\"journal\":{\"name\":\"2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC)\",\"volume\":\"149 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAECC50550.2020.9339518\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAECC50550.2020.9339518","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Ensemble Feature Selection from Cancer Gene Expression Data using Mutual Information and Recursive Feature Elimination
Availability of high through put gene expression data has enabled computational analysis of it for early diagnosis of diseases like cancer. This data contains expression values of thousands of genes in the genome of an organism. However, this gene expression data is very high dimensional, one dimension each corresponding to one genes in the genome and very few of these genes are associated with a disease. At the same time, the number of samples or observations available is very small as compared to the number of features, also this data suffers from class imbalance. Therefore, the task of selecting the genes that are relevant to the disease being studies is an important task and being researched widely in the computational sciences. In this paper, we have proposed a randomized ensemble method for feature selection from cancer gene expression data using a combination of mutual information and recursive feature elimination. The approach has been applied on Leukemia gene expression dataset. We obtained a classification accuracy of 99% with a gene subset of size 316 genes and with a subset of size 4 the accuracy is 95%. Thus we achieved a dimensionality reduction of 98.5% with 99% accuracy. Comparison with standard methods shows that the proposed method performs better.