数据采样与基因选择相结合的三种方法对生物信息学数据的分类性能

Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014) Pub Date : 2014-08-01 DOI:10.1109/IRI.2014.7051906

T. Khoshgoftaar, Alireza Fazelpour, D. Dittman, Amri Napolitano

{"title":"数据采样与基因选择相结合的三种方法对生物信息学数据的分类性能","authors":"T. Khoshgoftaar, Alireza Fazelpour, D. Dittman, Amri Napolitano","doi":"10.1109/IRI.2014.7051906","DOIUrl":null,"url":null,"abstract":"Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of the other class(es), and high dimensionality occurs when a dataset has many independent features (genes). Data sampling is often used to tackle the problem of class imbalance, and the problem of excessive features in the dataset may be alleviated through feature selection. In this work, we examine various approaches for applying these techniques simultaneously to tackle both of these challenges and build effective classification models. In particular, we ask whether the order of these techniques and the use of unsampled or sampled datasets for building classification models makes a difference. We conducted an empirical study on a series of seven high-dimensional and severely imbalanced biological datasets using six commonly used learners and four feature selection rankers from three different families of feature selection techniques. We compared three different data-sampling approaches: data sampling followed by feature selection using the unsampled data (DS-FS-UnSam) and selected features; data sampling followed by feature selection using the sampled data (DS-FS-Sam) and selected features; and feature selection followed by data sampling (FS-DS) using sampled data and selected features. We used Random Undersampling (RUS) to achieve the minority: majority class ratios of 35:65 and 50:50. The experimental results show that there are statistically significant differences among the three data-sampling approaches only when using the class ratio of 50:50, with a multiple comparison test showing that DS-FS-UnSam outperforms the other approaches. Thus, although specific combinations of learner and ranker may favor other approaches, across all choices of learner and ranker we would recommend the use of the DS-FS-UnSam approach for this class ratio. On the other hand, with the 35:65 class ratio, DS-FS-Sam was most frequently the top-performing approach, and although it was not statistically significantly better than the other approaches, we would generally recommend this approach be used for the 35:65 class ratio (although specific choices of learner and ranker may vary). Overall, we can see that the optimal approach will depend on the choice of class ratio.","PeriodicalId":360013,"journal":{"name":"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Classification performance of three approaches for combining data sampling and gene selection on bioinformatics data\",\"authors\":\"T. Khoshgoftaar, Alireza Fazelpour, D. Dittman, Amri Napolitano\",\"doi\":\"10.1109/IRI.2014.7051906\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of the other class(es), and high dimensionality occurs when a dataset has many independent features (genes). Data sampling is often used to tackle the problem of class imbalance, and the problem of excessive features in the dataset may be alleviated through feature selection. In this work, we examine various approaches for applying these techniques simultaneously to tackle both of these challenges and build effective classification models. In particular, we ask whether the order of these techniques and the use of unsampled or sampled datasets for building classification models makes a difference. We conducted an empirical study on a series of seven high-dimensional and severely imbalanced biological datasets using six commonly used learners and four feature selection rankers from three different families of feature selection techniques. We compared three different data-sampling approaches: data sampling followed by feature selection using the unsampled data (DS-FS-UnSam) and selected features; data sampling followed by feature selection using the sampled data (DS-FS-Sam) and selected features; and feature selection followed by data sampling (FS-DS) using sampled data and selected features. We used Random Undersampling (RUS) to achieve the minority: majority class ratios of 35:65 and 50:50. The experimental results show that there are statistically significant differences among the three data-sampling approaches only when using the class ratio of 50:50, with a multiple comparison test showing that DS-FS-UnSam outperforms the other approaches. Thus, although specific combinations of learner and ranker may favor other approaches, across all choices of learner and ranker we would recommend the use of the DS-FS-UnSam approach for this class ratio. On the other hand, with the 35:65 class ratio, DS-FS-Sam was most frequently the top-performing approach, and although it was not statistically significantly better than the other approaches, we would generally recommend this approach be used for the 35:65 class ratio (although specific choices of learner and ranker may vary). Overall, we can see that the optimal approach will depend on the choice of class ratio.\",\"PeriodicalId\":360013,\"journal\":{\"name\":\"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRI.2014.7051906\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2014.7051906","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

生物信息学数据集对研究人员和数据挖掘从业者提出了两个主要挑战:类不平衡和高维度。当一个类的实例数量远远超过另一个类的实例时，就会发生类不平衡;当一个数据集有许多独立的特征(基因)时，就会发生高维。数据采样常用于解决类不平衡问题，通过特征选择可以缓解数据集中特征过多的问题。在这项工作中，我们研究了同时应用这些技术来解决这两个挑战并构建有效分类模型的各种方法。特别是，我们询问这些技术的顺序以及用于构建分类模型的未采样或采样数据集的使用是否会产生差异。本文采用三种不同的特征选择技术，采用六种常用的学习器和四种特征选择排序器，对七个高维严重不平衡的生物数据集进行了实证研究。我们比较了三种不同的数据采样方法:使用未采样数据(DS-FS-UnSam)和选择的特征进行数据采样后的特征选择;数据采样，然后利用采样数据(DS-FS-Sam)和所选特征进行特征选择;特征选择，然后使用采样数据和选择的特征进行数据采样(FS-DS)。我们使用随机欠抽样(RUS)来实现少数:多数类别的比例为35:65和50:50。实验结果表明，只有当类比为50:50时，三种数据采样方法之间才存在统计学上的差异，多重对比检验表明DS-FS-UnSam优于其他方法。因此，尽管学习者和排名的特定组合可能倾向于其他方法，但在所有学习者和排名的选择中，我们建议使用DS-FS-UnSam方法来计算班级比例。另一方面，对于35:65的班级比例，DS-FS-Sam通常是表现最好的方法，尽管它在统计上并不比其他方法好，但我们通常建议将这种方法用于35:65的班级比例(尽管具体的学习者和排名可能会有所不同)。总的来说，我们可以看到，最优方法将取决于类比例的选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Classification performance of three approaches for combining data sampling and gene selection on bioinformatics data

Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of the other class(es), and high dimensionality occurs when a dataset has many independent features (genes). Data sampling is often used to tackle the problem of class imbalance, and the problem of excessive features in the dataset may be alleviated through feature selection. In this work, we examine various approaches for applying these techniques simultaneously to tackle both of these challenges and build effective classification models. In particular, we ask whether the order of these techniques and the use of unsampled or sampled datasets for building classification models makes a difference. We conducted an empirical study on a series of seven high-dimensional and severely imbalanced biological datasets using six commonly used learners and four feature selection rankers from three different families of feature selection techniques. We compared three different data-sampling approaches: data sampling followed by feature selection using the unsampled data (DS-FS-UnSam) and selected features; data sampling followed by feature selection using the sampled data (DS-FS-Sam) and selected features; and feature selection followed by data sampling (FS-DS) using sampled data and selected features. We used Random Undersampling (RUS) to achieve the minority: majority class ratios of 35:65 and 50:50. The experimental results show that there are statistically significant differences among the three data-sampling approaches only when using the class ratio of 50:50, with a multiple comparison test showing that DS-FS-UnSam outperforms the other approaches. Thus, although specific combinations of learner and ranker may favor other approaches, across all choices of learner and ranker we would recommend the use of the DS-FS-UnSam approach for this class ratio. On the other hand, with the 35:65 class ratio, DS-FS-Sam was most frequently the top-performing approach, and although it was not statistically significantly better than the other approaches, we would generally recommend this approach be used for the 35:65 class ratio (although specific choices of learner and ranker may vary). Overall, we can see that the optimal approach will depend on the choice of class ratio.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)

自引率

0.00%

发文量