The Impact of Under-sampling on the Performance of Bootstrap-based Ensemble Feature Selection

Signal Processing and Communications Applications Conference Pub Date : 2018-05-01 DOI:10.1109/SIU.2018.8404342

Huseyin Guney, H. Oztoprak

{"title":"The Impact of Under-sampling on the Performance of Bootstrap-based Ensemble Feature Selection","authors":"Huseyin Guney, H. Oztoprak","doi":"10.1109/SIU.2018.8404342","DOIUrl":null,"url":null,"abstract":"DNA Microarrays are promising tool for cancer diagnosis and prognosis. DNA Microarrays are high-dimensional and gene selection is a difficult task. However, Bootstrap-based ensemble feature selection (Bagging) recently becomes popular and shows significant improvements in the field. This method aims to generate several slightly different sampled datasets, using bootstrap resampling, from training dataset. Afterwards, it aggregates all ranked feature lists, generated from sampled datasets, to obtain final (ensemble) feature list. Performance of bagging is proportional to diversity of generated sampled datasets. Therefore, it is proposed to use under-sampling of training set instead of using entire training set for bootstrap resampling to improve classification performance and gene selection stability. The proposed method was evaluated using support vector machine (SVM) as the classifier and recursive feature elimination (SVM-RFE) as the feature selection technique. Four microarray datasets were used for evaluation of the proposed method. The results show that 50% under-sampling approach have similar classification performance and outperforms conventional approach in terms of gene selection stability. In addition, 50% under-sampling uses only half of the samples of training dataset at each run of ensemble method so it has less computational cost.","PeriodicalId":409299,"journal":{"name":"Signal Processing and Communications Applications Conference","volume":"297 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing and Communications Applications Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU.2018.8404342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

DNA Microarrays are promising tool for cancer diagnosis and prognosis. DNA Microarrays are high-dimensional and gene selection is a difficult task. However, Bootstrap-based ensemble feature selection (Bagging) recently becomes popular and shows significant improvements in the field. This method aims to generate several slightly different sampled datasets, using bootstrap resampling, from training dataset. Afterwards, it aggregates all ranked feature lists, generated from sampled datasets, to obtain final (ensemble) feature list. Performance of bagging is proportional to diversity of generated sampled datasets. Therefore, it is proposed to use under-sampling of training set instead of using entire training set for bootstrap resampling to improve classification performance and gene selection stability. The proposed method was evaluated using support vector machine (SVM) as the classifier and recursive feature elimination (SVM-RFE) as the feature selection technique. Four microarray datasets were used for evaluation of the proposed method. The results show that 50% under-sampling approach have similar classification performance and outperforms conventional approach in terms of gene selection stability. In addition, 50% under-sampling uses only half of the samples of training dataset at each run of ensemble method so it has less computational cost.

查看原文本刊更多论文

欠采样对基于引导的集成特征选择性能的影响

DNA微阵列是一种很有前途的癌症诊断和预后工具。DNA微阵列是高维的，基因选择是一项艰巨的任务。然而，基于bootstrap的集成特征选择(Bagging)最近变得流行起来，并在该领域显示出显著的改进。该方法的目的是使用自举重采样从训练数据集生成几个略有不同的采样数据集。然后，对采样数据集生成的所有排序特征列表进行聚合，得到最终的(集成)特征列表。装袋的性能与生成的采样数据集的多样性成正比。因此，为了提高分类性能和基因选择的稳定性，本文提出对训练集进行欠采样而不是对整个训练集进行自举重采样。采用支持向量机(SVM)作为分类器，递归特征消除(SVM- rfe)作为特征选择技术对该方法进行了评价。使用四个微阵列数据集对所提出的方法进行评估。结果表明，50%欠采样方法具有相似的分类性能，并且在基因选择稳定性方面优于传统方法。此外，50%欠采样方法在每次运行时只使用训练数据集的一半样本，因此计算成本更低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Signal Processing and Communications Applications Conference

自引率

0.00%

发文量