基于互信息和递归特征消除的肿瘤基因表达数据集成特征选择

2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC) Pub Date : 2020-12-11 DOI:10.1109/ICAECC50550.2020.9339518

Nimrita Koul, S. Manvi

{"title":"基于互信息和递归特征消除的肿瘤基因表达数据集成特征选择","authors":"Nimrita Koul, S. Manvi","doi":"10.1109/ICAECC50550.2020.9339518","DOIUrl":null,"url":null,"abstract":"Availability of high through put gene expression data has enabled computational analysis of it for early diagnosis of diseases like cancer. This data contains expression values of thousands of genes in the genome of an organism. However, this gene expression data is very high dimensional, one dimension each corresponding to one genes in the genome and very few of these genes are associated with a disease. At the same time, the number of samples or observations available is very small as compared to the number of features, also this data suffers from class imbalance. Therefore, the task of selecting the genes that are relevant to the disease being studies is an important task and being researched widely in the computational sciences. In this paper, we have proposed a randomized ensemble method for feature selection from cancer gene expression data using a combination of mutual information and recursive feature elimination. The approach has been applied on Leukemia gene expression dataset. We obtained a classification accuracy of 99% with a gene subset of size 316 genes and with a subset of size 4 the accuracy is 95%. Thus we achieved a dimensionality reduction of 98.5% with 99% accuracy. Comparison with standard methods shows that the proposed method performs better.","PeriodicalId":196343,"journal":{"name":"2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Ensemble Feature Selection from Cancer Gene Expression Data using Mutual Information and Recursive Feature Elimination\",\"authors\":\"Nimrita Koul, S. Manvi\",\"doi\":\"10.1109/ICAECC50550.2020.9339518\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Availability of high through put gene expression data has enabled computational analysis of it for early diagnosis of diseases like cancer. This data contains expression values of thousands of genes in the genome of an organism. However, this gene expression data is very high dimensional, one dimension each corresponding to one genes in the genome and very few of these genes are associated with a disease. At the same time, the number of samples or observations available is very small as compared to the number of features, also this data suffers from class imbalance. Therefore, the task of selecting the genes that are relevant to the disease being studies is an important task and being researched widely in the computational sciences. In this paper, we have proposed a randomized ensemble method for feature selection from cancer gene expression data using a combination of mutual information and recursive feature elimination. The approach has been applied on Leukemia gene expression dataset. We obtained a classification accuracy of 99% with a gene subset of size 316 genes and with a subset of size 4 the accuracy is 95%. Thus we achieved a dimensionality reduction of 98.5% with 99% accuracy. Comparison with standard methods shows that the proposed method performs better.\",\"PeriodicalId\":196343,\"journal\":{\"name\":\"2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC)\",\"volume\":\"149 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAECC50550.2020.9339518\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAECC50550.2020.9339518","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

高通量基因表达数据的可用性使得它的计算分析能够用于癌症等疾病的早期诊断。这些数据包含了生物体基因组中数千个基因的表达值。然而，这种基因表达数据是非常高的维度，一个维度对应于基因组中的一个基因，这些基因很少与疾病相关。同时，与特征的数量相比，可用的样本或观测的数量非常少，而且这些数据也存在类不平衡的问题。因此，选择与所研究的疾病相关的基因是计算科学中一项重要的任务，正在得到广泛的研究。本文提出了一种结合互信息和递归特征消除的随机集成方法，用于癌症基因表达数据的特征选择。该方法已应用于白血病基因表达数据集。当基因子集的大小为316个基因时，我们获得了99%的分类准确率，而当基因子集的大小为4时，准确率为95%。因此，我们以99%的准确率实现了98.5%的降维。与标准方法的比较表明，该方法具有更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Ensemble Feature Selection from Cancer Gene Expression Data using Mutual Information and Recursive Feature Elimination

Availability of high through put gene expression data has enabled computational analysis of it for early diagnosis of diseases like cancer. This data contains expression values of thousands of genes in the genome of an organism. However, this gene expression data is very high dimensional, one dimension each corresponding to one genes in the genome and very few of these genes are associated with a disease. At the same time, the number of samples or observations available is very small as compared to the number of features, also this data suffers from class imbalance. Therefore, the task of selecting the genes that are relevant to the disease being studies is an important task and being researched widely in the computational sciences. In this paper, we have proposed a randomized ensemble method for feature selection from cancer gene expression data using a combination of mutual information and recursive feature elimination. The approach has been applied on Leukemia gene expression dataset. We obtained a classification accuracy of 99% with a gene subset of size 316 genes and with a subset of size 4 the accuracy is 95%. Thus we achieved a dimensionality reduction of 98.5% with 99% accuracy. Comparison with standard methods shows that the proposed method performs better.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC)

自引率

0.00%

发文量