The Impact of Under-sampling on the Performance of Bootstrap-based Ensemble Feature Selection

Huseyin Guney, H. Oztoprak
{"title":"The Impact of Under-sampling on the Performance of Bootstrap-based Ensemble Feature Selection","authors":"Huseyin Guney, H. Oztoprak","doi":"10.1109/SIU.2018.8404342","DOIUrl":null,"url":null,"abstract":"DNA Microarrays are promising tool for cancer diagnosis and prognosis. DNA Microarrays are high-dimensional and gene selection is a difficult task. However, Bootstrap-based ensemble feature selection (Bagging) recently becomes popular and shows significant improvements in the field. This method aims to generate several slightly different sampled datasets, using bootstrap resampling, from training dataset. Afterwards, it aggregates all ranked feature lists, generated from sampled datasets, to obtain final (ensemble) feature list. Performance of bagging is proportional to diversity of generated sampled datasets. Therefore, it is proposed to use under-sampling of training set instead of using entire training set for bootstrap resampling to improve classification performance and gene selection stability. The proposed method was evaluated using support vector machine (SVM) as the classifier and recursive feature elimination (SVM-RFE) as the feature selection technique. Four microarray datasets were used for evaluation of the proposed method. The results show that 50% under-sampling approach have similar classification performance and outperforms conventional approach in terms of gene selection stability. In addition, 50% under-sampling uses only half of the samples of training dataset at each run of ensemble method so it has less computational cost.","PeriodicalId":409299,"journal":{"name":"Signal Processing and Communications Applications Conference","volume":"297 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing and Communications Applications Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU.2018.8404342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

DNA Microarrays are promising tool for cancer diagnosis and prognosis. DNA Microarrays are high-dimensional and gene selection is a difficult task. However, Bootstrap-based ensemble feature selection (Bagging) recently becomes popular and shows significant improvements in the field. This method aims to generate several slightly different sampled datasets, using bootstrap resampling, from training dataset. Afterwards, it aggregates all ranked feature lists, generated from sampled datasets, to obtain final (ensemble) feature list. Performance of bagging is proportional to diversity of generated sampled datasets. Therefore, it is proposed to use under-sampling of training set instead of using entire training set for bootstrap resampling to improve classification performance and gene selection stability. The proposed method was evaluated using support vector machine (SVM) as the classifier and recursive feature elimination (SVM-RFE) as the feature selection technique. Four microarray datasets were used for evaluation of the proposed method. The results show that 50% under-sampling approach have similar classification performance and outperforms conventional approach in terms of gene selection stability. In addition, 50% under-sampling uses only half of the samples of training dataset at each run of ensemble method so it has less computational cost.
欠采样对基于引导的集成特征选择性能的影响
DNA微阵列是一种很有前途的癌症诊断和预后工具。DNA微阵列是高维的,基因选择是一项艰巨的任务。然而,基于bootstrap的集成特征选择(Bagging)最近变得流行起来,并在该领域显示出显著的改进。该方法的目的是使用自举重采样从训练数据集生成几个略有不同的采样数据集。然后,对采样数据集生成的所有排序特征列表进行聚合,得到最终的(集成)特征列表。装袋的性能与生成的采样数据集的多样性成正比。因此,为了提高分类性能和基因选择的稳定性,本文提出对训练集进行欠采样而不是对整个训练集进行自举重采样。采用支持向量机(SVM)作为分类器,递归特征消除(SVM- rfe)作为特征选择技术对该方法进行了评价。使用四个微阵列数据集对所提出的方法进行评估。结果表明,50%欠采样方法具有相似的分类性能,并且在基因选择稳定性方面优于传统方法。此外,50%欠采样方法在每次运行时只使用训练数据集的一半样本,因此计算成本更低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信