Gene selection stability's dependence on dataset difficulty

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI) Pub Date : 2013-10-24 DOI:10.1109/IRI.2013.6642491

D. Dittman, T. Khoshgoftaar, Randall Wald, Amri Napolitano

{"title":"Gene selection stability's dependence on dataset difficulty","authors":"D. Dittman, T. Khoshgoftaar, Randall Wald, Amri Napolitano","doi":"10.1109/IRI.2013.6642491","DOIUrl":null,"url":null,"abstract":"Identifying important biomarkers to improve disease diagnosis and treatment is a significant topic of research in bioinformatics. However, bioinformatics datasets frequently have a large number of features per sample or instance. This problem, known as “high dimensionality,” can be alleviated through the use of dimension reducing techniques such as feature (gene) selection which remove unnecessary features. There are many versions of feature selection, with varying biases and predictive abilities. However, predictive power is but one factor to consider when choosing a feature selection technique: one must also consider the technique's stability, that is, its ability to create feature subsets which remain valid in the face of changes to the data. While there has been work in determining the relative stability of different feature selection techniques, this does not always help determine whether a chosen feature selection technique will give stable feature subsets for a specific dataset. Factors such as difficulty of learning (e.g., dataset difficulty) may also influence feature selection stability, making generally-true facts about different techniques not applicable to a given dataset. In this work, we study how dataset difficulty can affect the stability of feature selection techniques, leading to good performance from bad techniques and vice versa. We use a set of twenty-six DNA microarray datasets with varying levels of difficulty of learning, along with four levels of dataset perturbation, six feature selection techniques with various levels of stability, and twelve feature subset sizes. The results show that as the dataset difficulty increases, the stability decreases. However, the relative stability between the techniques remains the same. Additionally, the more difficult the dataset, the more the stability is affected by changes to the data. We also found that unstable rankers are more affected by the transition between Easy and Moderate datasets, whereas the stable techniques are more affected by the change between Moderate and Hard datasets. Lastly, as the feature subset size increases, the stability increases and the difference between the levels of dataset difficulty decreases. Overall, we conclude that difficulty of learning must be taken into account before interpreting stability results.","PeriodicalId":418492,"journal":{"name":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2013.6642491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Identifying important biomarkers to improve disease diagnosis and treatment is a significant topic of research in bioinformatics. However, bioinformatics datasets frequently have a large number of features per sample or instance. This problem, known as “high dimensionality,” can be alleviated through the use of dimension reducing techniques such as feature (gene) selection which remove unnecessary features. There are many versions of feature selection, with varying biases and predictive abilities. However, predictive power is but one factor to consider when choosing a feature selection technique: one must also consider the technique's stability, that is, its ability to create feature subsets which remain valid in the face of changes to the data. While there has been work in determining the relative stability of different feature selection techniques, this does not always help determine whether a chosen feature selection technique will give stable feature subsets for a specific dataset. Factors such as difficulty of learning (e.g., dataset difficulty) may also influence feature selection stability, making generally-true facts about different techniques not applicable to a given dataset. In this work, we study how dataset difficulty can affect the stability of feature selection techniques, leading to good performance from bad techniques and vice versa. We use a set of twenty-six DNA microarray datasets with varying levels of difficulty of learning, along with four levels of dataset perturbation, six feature selection techniques with various levels of stability, and twelve feature subset sizes. The results show that as the dataset difficulty increases, the stability decreases. However, the relative stability between the techniques remains the same. Additionally, the more difficult the dataset, the more the stability is affected by changes to the data. We also found that unstable rankers are more affected by the transition between Easy and Moderate datasets, whereas the stable techniques are more affected by the change between Moderate and Hard datasets. Lastly, as the feature subset size increases, the stability increases and the difference between the levels of dataset difficulty decreases. Overall, we conclude that difficulty of learning must be taken into account before interpreting stability results.

查看原文本刊更多论文

基因选择稳定性对数据集难度的依赖

识别重要的生物标志物以提高疾病的诊断和治疗是生物信息学研究的一个重要课题。然而，生物信息学数据集通常具有每个样本或实例的大量特征。这个问题被称为“高维”，可以通过使用降维技术来缓解，比如特征(基因)选择，它可以去除不必要的特征。有许多版本的特征选择，具有不同的偏差和预测能力。然而，在选择特征选择技术时，预测能力只是需要考虑的一个因素:人们还必须考虑技术的稳定性，即它创建特征子集的能力，这些特征子集在面对数据变化时仍然有效。虽然在确定不同特征选择技术的相对稳定性方面已经有了一些工作，但这并不总是有助于确定所选择的特征选择技术是否会为特定数据集提供稳定的特征子集。学习难度(例如，数据集难度)等因素也可能影响特征选择的稳定性，使关于不同技术的一般真实事实不适用于给定的数据集。在这项工作中，我们研究了数据集难度如何影响特征选择技术的稳定性，从而从糟糕的技术中获得良好的性能，反之亦然。我们使用了一组26个具有不同学习难度的DNA微阵列数据集，以及4个级别的数据集扰动，6个具有不同稳定性水平的特征选择技术和12个特征子集大小。结果表明，随着数据集难度的增加，稳定性降低。然而，这些技术之间的相对稳定性保持不变。此外，数据集越难，稳定性受数据变化的影响越大。我们还发现，不稳定的排名更容易受到简单数据集和中等数据集之间转换的影响，而稳定的排名更容易受到中等数据集和硬数据集之间变化的影响。最后，随着特征子集大小的增加，稳定性增加，数据集难度等级之间的差异减小。总的来说，我们得出的结论是，在解释稳定性结果之前，必须考虑到学习的困难。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)

自引率

0.00%

发文量