特征子集选择技术稳定性分析综述

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI) Pub Date : 2013-10-24 DOI:10.1109/IRI.2013.6642502

T. Khoshgoftaar, Alireza Fazelpour, Huanjing Wang, Randall Wald

{"title":"特征子集选择技术稳定性分析综述","authors":"T. Khoshgoftaar, Alireza Fazelpour, Huanjing Wang, Randall Wald","doi":"10.1109/IRI.2013.6642502","DOIUrl":null,"url":null,"abstract":"With the proliferation of high-dimensional datasets across many application domains in recent years, feature selection has become an important data mining task due to its capability to improve both performance and computational efficiencies. The chosen feature subset is important not only due to its ability to improve classification performance, but also because in some domains, knowing the most important features is an end unto itself. In this latter case, one important property of a feature selection method is stability, which refers to insensitivity (robustness) of the selected features to small changes in the training dataset. In this survey paper, we discuss the problem of stability, its importance, and various stability measures used to evaluate feature subsets. We place special focus on the problem of stability as it applies to subset evaluation approaches (whether they are selected through filter-based subset techniques or wrapper-based subset selection techniques) as opposed to feature ranker stability, as subset evaluation stability leads to challenges which have been the subject of less research. We also discuss one domain of particular importance where subset evaluation (and the stability thereof) shows particular importance, but which has previously had relatively little attention for subset-based feature selection: Big Data which originates from bioinformatics.","PeriodicalId":418492,"journal":{"name":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","volume":"229 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":"{\"title\":\"A survey of stability analysis of feature subset selection techniques\",\"authors\":\"T. Khoshgoftaar, Alireza Fazelpour, Huanjing Wang, Randall Wald\",\"doi\":\"10.1109/IRI.2013.6642502\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the proliferation of high-dimensional datasets across many application domains in recent years, feature selection has become an important data mining task due to its capability to improve both performance and computational efficiencies. The chosen feature subset is important not only due to its ability to improve classification performance, but also because in some domains, knowing the most important features is an end unto itself. In this latter case, one important property of a feature selection method is stability, which refers to insensitivity (robustness) of the selected features to small changes in the training dataset. In this survey paper, we discuss the problem of stability, its importance, and various stability measures used to evaluate feature subsets. We place special focus on the problem of stability as it applies to subset evaluation approaches (whether they are selected through filter-based subset techniques or wrapper-based subset selection techniques) as opposed to feature ranker stability, as subset evaluation stability leads to challenges which have been the subject of less research. We also discuss one domain of particular importance where subset evaluation (and the stability thereof) shows particular importance, but which has previously had relatively little attention for subset-based feature selection: Big Data which originates from bioinformatics.\",\"PeriodicalId\":418492,\"journal\":{\"name\":\"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)\",\"volume\":\"229 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"31\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRI.2013.6642502\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2013.6642502","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

摘要

近年来，随着高维数据集在许多应用领域的激增，特征选择由于能够提高性能和计算效率而成为一项重要的数据挖掘任务。所选择的特征子集很重要，不仅因为它能够提高分类性能，还因为在某些领域，知道最重要的特征本身就是目的。在后一种情况下，特征选择方法的一个重要特性是稳定性，这是指所选特征对训练数据集中的微小变化不敏感(鲁棒性)。在这篇调查论文中，我们讨论了稳定性问题，它的重要性，以及用于评估特征子集的各种稳定性度量。我们特别关注稳定性问题，因为它适用于子集评估方法(无论它们是通过基于过滤器的子集技术还是基于包装的子集选择技术选择的)，而不是特征排序稳定性，因为子集评估稳定性导致的挑战一直是较少研究的主题。我们还讨论了一个特别重要的领域，其中子集评估(及其稳定性)显示出特别的重要性，但以前对基于子集的特征选择的关注相对较少:源于生物信息学的大数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A survey of stability analysis of feature subset selection techniques

With the proliferation of high-dimensional datasets across many application domains in recent years, feature selection has become an important data mining task due to its capability to improve both performance and computational efficiencies. The chosen feature subset is important not only due to its ability to improve classification performance, but also because in some domains, knowing the most important features is an end unto itself. In this latter case, one important property of a feature selection method is stability, which refers to insensitivity (robustness) of the selected features to small changes in the training dataset. In this survey paper, we discuss the problem of stability, its importance, and various stability measures used to evaluate feature subsets. We place special focus on the problem of stability as it applies to subset evaluation approaches (whether they are selected through filter-based subset techniques or wrapper-based subset selection techniques) as opposed to feature ranker stability, as subset evaluation stability leads to challenges which have been the subject of less research. We also discuss one domain of particular importance where subset evaluation (and the stability thereof) shows particular importance, but which has previously had relatively little attention for subset-based feature selection: Big Data which originates from bioinformatics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)

自引率

0.00%

发文量