A Cost-Sensitive Feature Selection Method for High-Dimensional Data

2019 14th International Conference on Computer Science & Education (ICCSE) Pub Date : 2019-08-01 DOI:10.1109/ICCSE.2019.8845414

Chaojie An, Qifeng Zhou

{"title":"A Cost-Sensitive Feature Selection Method for High-Dimensional Data","authors":"Chaojie An, Qifeng Zhou","doi":"10.1109/ICCSE.2019.8845414","DOIUrl":null,"url":null,"abstract":"With the increase of data dimension in many application fields, feature selection, as an essential step to avoid the curse of dimensionality and enhanced the generalization of the model, is attracting more and more research attention. However, most existing feature selection methods always assume the features have the same cost. These research efforts mainly focus on features’ relevance to learning performance while neglecting the cost to obtain them. Feature cost is a crucial factor need to be considered in feature selection problem especially for the real world applications. For example, in the process of medical diagnosis, each feature may have a very different testing cost. To select low-cost subsets of informative features, in this paper, we propose a stratified random forest-based cost-sensitive feature selection method. Unlike commonly used two-step cost-sensitive feature selection approaches, in our model, the cost of features is incorporated into the construction process of the base decision tree, that is, the cost and the performance of each feature are optimized simultaneously. Moreover, we adopt a stratified sample method to enhance the performance of the feature subset for high-dimensional data. A series of experimental results show that compared with the state-of-the-art methods, the proposed approach can lower the cost of the selected feature subset while maintaining comparable learning performance.","PeriodicalId":351346,"journal":{"name":"2019 14th International Conference on Computer Science & Education (ICCSE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 14th International Conference on Computer Science & Education (ICCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCSE.2019.8845414","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

With the increase of data dimension in many application fields, feature selection, as an essential step to avoid the curse of dimensionality and enhanced the generalization of the model, is attracting more and more research attention. However, most existing feature selection methods always assume the features have the same cost. These research efforts mainly focus on features’ relevance to learning performance while neglecting the cost to obtain them. Feature cost is a crucial factor need to be considered in feature selection problem especially for the real world applications. For example, in the process of medical diagnosis, each feature may have a very different testing cost. To select low-cost subsets of informative features, in this paper, we propose a stratified random forest-based cost-sensitive feature selection method. Unlike commonly used two-step cost-sensitive feature selection approaches, in our model, the cost of features is incorporated into the construction process of the base decision tree, that is, the cost and the performance of each feature are optimized simultaneously. Moreover, we adopt a stratified sample method to enhance the performance of the feature subset for high-dimensional data. A series of experimental results show that compared with the state-of-the-art methods, the proposed approach can lower the cost of the selected feature subset while maintaining comparable learning performance.

查看原文本刊更多论文

一种代价敏感的高维数据特征选择方法

随着许多应用领域中数据维数的增加，特征选择作为避免维数诅咒和增强模型泛化的重要步骤，越来越受到研究人员的关注。然而，大多数现有的特征选择方法总是假设特征具有相同的代价。这些研究主要关注特征与学习性能的相关性，而忽略了获取特征的成本。特征成本是特征选择问题中需要考虑的一个重要因素，特别是在实际应用中。例如，在医学诊断过程中，每个特征可能有非常不同的测试成本。为了选择信息特征的低成本子集，本文提出了一种基于分层随机森林的成本敏感特征选择方法。与常用的两步成本敏感特征选择方法不同，在我们的模型中，特征的成本被纳入到基本决策树的构建过程中，即同时优化每个特征的成本和性能。此外，我们采用分层样本方法来提高高维数据特征子集的性能。一系列的实验结果表明，与现有的方法相比，该方法可以降低所选特征子集的成本，同时保持相当的学习性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 14th International Conference on Computer Science & Education (ICCSE)

自引率

0.00%

发文量