The use of balance-aware subsampling for bioinformatics datasets

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI) Pub Date : 2013-10-24 DOI:10.1109/IRI.2013.6642489

Randall Wald, T. Khoshgoftaar, Alireza Fazelpour

{"title":"The use of balance-aware subsampling for bioinformatics datasets","authors":"Randall Wald, T. Khoshgoftaar, Alireza Fazelpour","doi":"10.1109/IRI.2013.6642489","DOIUrl":null,"url":null,"abstract":"A major challenge facing data-mining practitioners in the field of bioinformatics is class imbalance, which occurs when instances of one class (called the majority class) vastly outnumber instances of the other (minority) classes. This can result in models with increased bias towards the majority class (minority-class instances predicted as being in the majority class). Data sampling, a process which changes the dataset through removing or adding instances to improve the class balance, can be used to improve the performance of such models on imbalanced data. However, it is not clear what target balance level should be used with data sampling, and what influence class imbalance alone has on classification performance (compared to other issues such as difficulty of learning from the data and dataset size). To resolve this, we propose the Balance-Aware Subsampling technique, which allows researchers to directly compare different balance levels of a dataset while keeping all other factors (such as dataset size and the actual dataset in question) constant. Thus, any changes in performance can be attributed solely to the chosen balance level. We demonstrate this technique using six datasets from the field of bioinformatics, and we also consider three different subsample sizes (that is, the size of the dataset used for building a model) so we can observe the effect of this parameter on classification performance. Our results show that within each level of class imbalance, the average AUC value increases as the subsample size increases. The key exception is the 20:80 (minority:majority) balance level, for which the average AUC value decreases as the subsample size increases from 80 to 120. We also find that within each subsample size, the average AUC value increases as the minority distribution increases, although this does not completely hold for subsample size 40 (in which case, the Näıve Bayes and Random Forest learners show greater performance at the 35:65 balance level than at 50:50), and in general there is not a significant improvement between the 35:65 and 50:50 balance levels. Overall, by using Balance-Aware Subsampling, we are able to directly observe how class imbalance affects performance isolated from all other factors.","PeriodicalId":418492,"journal":{"name":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2013.6642489","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

A major challenge facing data-mining practitioners in the field of bioinformatics is class imbalance, which occurs when instances of one class (called the majority class) vastly outnumber instances of the other (minority) classes. This can result in models with increased bias towards the majority class (minority-class instances predicted as being in the majority class). Data sampling, a process which changes the dataset through removing or adding instances to improve the class balance, can be used to improve the performance of such models on imbalanced data. However, it is not clear what target balance level should be used with data sampling, and what influence class imbalance alone has on classification performance (compared to other issues such as difficulty of learning from the data and dataset size). To resolve this, we propose the Balance-Aware Subsampling technique, which allows researchers to directly compare different balance levels of a dataset while keeping all other factors (such as dataset size and the actual dataset in question) constant. Thus, any changes in performance can be attributed solely to the chosen balance level. We demonstrate this technique using six datasets from the field of bioinformatics, and we also consider three different subsample sizes (that is, the size of the dataset used for building a model) so we can observe the effect of this parameter on classification performance. Our results show that within each level of class imbalance, the average AUC value increases as the subsample size increases. The key exception is the 20:80 (minority:majority) balance level, for which the average AUC value decreases as the subsample size increases from 80 to 120. We also find that within each subsample size, the average AUC value increases as the minority distribution increases, although this does not completely hold for subsample size 40 (in which case, the Näıve Bayes and Random Forest learners show greater performance at the 35:65 balance level than at 50:50), and in general there is not a significant improvement between the 35:65 and 50:50 balance levels. Overall, by using Balance-Aware Subsampling, we are able to directly observe how class imbalance affects performance isolated from all other factors.

查看原文本刊更多论文

生物信息学数据集平衡感知子采样的使用

生物信息学领域的数据挖掘从业者面临的一个主要挑战是类不平衡，当一个类(称为多数类)的实例远远超过其他类(少数类)的实例时，就会发生这种情况。这可能导致模型对多数类的偏见增加(少数类的实例被预测为多数类)。数据采样是一个通过删除或添加实例来改变数据集以改善类平衡的过程，可以用来提高这些模型在不平衡数据上的性能。然而，目前尚不清楚数据采样应该使用什么样的目标平衡水平，以及类不平衡本身对分类性能有什么影响(与其他问题(如从数据中学习的难度和数据集大小)相比)。为了解决这个问题，我们提出了balance - aware Subsampling技术，该技术允许研究人员在保持所有其他因素(如数据集大小和实际数据集)不变的情况下，直接比较数据集的不同平衡水平。因此，性能的任何变化都可以完全归因于所选择的平衡级别。我们使用来自生物信息学领域的六个数据集来演示这种技术，并且我们还考虑了三种不同的子样本大小(即用于构建模型的数据集的大小)，因此我们可以观察该参数对分类性能的影响。我们的研究结果表明，在每一个类失衡水平内，平均AUC值随着子样本数量的增加而增加。关键的例外是20:80(少数:多数)平衡水平，其平均AUC值随着子样本量从80增加到120而减小。我们还发现，在每个子样本量内，平均AUC值随着少数分布的增加而增加，尽管这并不完全适用于子样本量为40的情况(在这种情况下，Näıve贝叶斯和随机森林学习器在35:65平衡水平上的表现优于50:50)，并且通常在35:65和50:50平衡水平之间没有显着改善。总的来说，通过使用Balance-Aware Subsampling，我们能够直接观察到类不平衡是如何孤立于所有其他因素影响性能的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)

自引率

0.00%

发文量