The effect of measurement approach and noise level on gene selection stability

2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops Pub Date : 2012-10-04 DOI:10.1109/BIBM.2012.6392713

Randall Wald, T. Khoshgoftaar, A. A. Shanab

{"title":"The effect of measurement approach and noise level on gene selection stability","authors":"Randall Wald, T. Khoshgoftaar, A. A. Shanab","doi":"10.1109/BIBM.2012.6392713","DOIUrl":null,"url":null,"abstract":"Many biological datasets exhibit high dimensionality, a large abundance of attributes (genes) per instance (sample). This problem is often solved using feature selection, which works by selecting the most relevant attributes and removing irrelevant and redundant attributes. Although feature selection techniques are often evaluated based on the performance of classification models (e.g., algorithms designed to distinguish between multiple classes of instances, such as cancerous vs. noncancerous) built using the selected features, another important criterion which is often neglected is stability, the degree of agreement among a feature selection technique's outputs when there are changes to the dataset. More stable feature selection techniques will give the same features even if aspects of the data change. In this study we consider two different approaches for evaluating the stability of feature selection techniques, with each approach consisting of noise injection followed by feature ranking. The two approaches differ in that the first approach compares the features selected from the noisy datasets with the features selected from the original (clean) dataset, while the second approach performs pairwise comparisons among the results from the noisy datasets. To evaluate these two approaches, we use four biological datasets and employ six commonly-used feature rankers. We draw two primary conclusions from our experiments: First, the rankers show different levels of stability in the face of noise. In particular, the ReliefF ranker has significantly greater stability than the other rankers. Also, we found that both approaches gave the same results in terms of stability patterns, although the first approach had greater stability overall. Additionally, because the first approach is significantly less computationally expensive, future studies may employ a faster technique to gain the same results.","PeriodicalId":6392,"journal":{"name":"2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops","volume":"187 1","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2012.6392713","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Many biological datasets exhibit high dimensionality, a large abundance of attributes (genes) per instance (sample). This problem is often solved using feature selection, which works by selecting the most relevant attributes and removing irrelevant and redundant attributes. Although feature selection techniques are often evaluated based on the performance of classification models (e.g., algorithms designed to distinguish between multiple classes of instances, such as cancerous vs. noncancerous) built using the selected features, another important criterion which is often neglected is stability, the degree of agreement among a feature selection technique's outputs when there are changes to the dataset. More stable feature selection techniques will give the same features even if aspects of the data change. In this study we consider two different approaches for evaluating the stability of feature selection techniques, with each approach consisting of noise injection followed by feature ranking. The two approaches differ in that the first approach compares the features selected from the noisy datasets with the features selected from the original (clean) dataset, while the second approach performs pairwise comparisons among the results from the noisy datasets. To evaluate these two approaches, we use four biological datasets and employ six commonly-used feature rankers. We draw two primary conclusions from our experiments: First, the rankers show different levels of stability in the face of noise. In particular, the ReliefF ranker has significantly greater stability than the other rankers. Also, we found that both approaches gave the same results in terms of stability patterns, although the first approach had greater stability overall. Additionally, because the first approach is significantly less computationally expensive, future studies may employ a faster technique to gain the same results.

查看原文本刊更多论文

测量方法和噪声水平对基因选择稳定性的影响

许多生物数据集表现出高维性，每个实例(样本)具有大量的属性(基因)。这个问题通常使用特征选择来解决，它通过选择最相关的属性并去除不相关和冗余的属性来工作。虽然特征选择技术通常是基于使用所选特征构建的分类模型的性能来评估的(例如，设计用于区分多个类别的实例的算法，例如癌症与非癌症)，但另一个经常被忽视的重要标准是稳定性，即当数据集发生变化时特征选择技术输出之间的一致性程度。更稳定的特征选择技术即使数据的某些方面发生了变化，也会给出相同的特征。在本研究中，我们考虑了两种不同的方法来评估特征选择技术的稳定性，每种方法都包括噪声注入和特征排序。这两种方法的不同之处在于，第一种方法将从噪声数据集中选择的特征与从原始(干净)数据集中选择的特征进行比较，而第二种方法在噪声数据集中的结果之间进行两两比较。为了评估这两种方法，我们使用了四个生物数据集，并使用了六个常用的特征排序器。我们从实验中得出两个主要结论:首先，在面对噪声时，排名者表现出不同程度的稳定性。特别是，ReliefF排名比其他排名具有更大的稳定性。此外，我们发现两种方法在稳定性模式方面给出了相同的结果，尽管第一种方法总体上具有更大的稳定性。此外，由于第一种方法的计算成本要低得多，未来的研究可能会采用更快的技术来获得相同的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops

自引率

0.00%

发文量