Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?

2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2015-12-01 DOI:10.1109/ICMLA.2015.23

Alireza Fazelpour, T. Khoshgoftaar, D. Dittman, Amri Napolitano

{"title":"Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?","authors":"Alireza Fazelpour, T. Khoshgoftaar, D. Dittman, Amri Napolitano","doi":"10.1109/ICMLA.2015.23","DOIUrl":null,"url":null,"abstract":"Bioinformatics datasets contain many challenging characteristics, such as class imbalance, which adversely impacts the performance of supervised classification models built on these datasets. Techniques such as ensemble learning and data sampling from the domain of data mining can be deployed to alleviate the problem and to improve the classification performance. In this study, we sought to seek whether inclusion of data sampling within the ensemble framework can further improve the performance of classification models. To this end, we performed an experimental study using two newly hybrid ensemble techniques, one integrates feature selection within the boosting process and the other incorporates random under-sampling followed by feature selection within the boosting framework, two learners, three forms of feature rankers, and four feature subset sizes on 15 highly imbalanced bioinformatics datasets. Our results and statistical analysis demonstrate that the difference between the two boosting methods is statistically insignificant. Therefore, as the inclusion of data sampling has no significant positive effect on the performance of ensemble classifiers, it is not required to achieve maximum classification performance. To our knowledge, this is the first empirical study that examined the effects of data sampling, random under-sampling, to enhance classification performance of boosting algorithm for highly imbalanced bioinformatics data.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Bioinformatics datasets contain many challenging characteristics, such as class imbalance, which adversely impacts the performance of supervised classification models built on these datasets. Techniques such as ensemble learning and data sampling from the domain of data mining can be deployed to alleviate the problem and to improve the classification performance. In this study, we sought to seek whether inclusion of data sampling within the ensemble framework can further improve the performance of classification models. To this end, we performed an experimental study using two newly hybrid ensemble techniques, one integrates feature selection within the boosting process and the other incorporates random under-sampling followed by feature selection within the boosting framework, two learners, three forms of feature rankers, and four feature subset sizes on 15 highly imbalanced bioinformatics datasets. Our results and statistical analysis demonstrate that the difference between the two boosting methods is statistically insignificant. Therefore, as the inclusion of data sampling has no significant positive effect on the performance of ensemble classifiers, it is not required to achieve maximum classification performance. To our knowledge, this is the first empirical study that examined the effects of data sampling, random under-sampling, to enhance classification performance of boosting algorithm for highly imbalanced bioinformatics data.

查看原文本刊更多论文

数据采样的加入是否提高了非平衡生物信息学数据的增强算法的性能?

生物信息学数据集包含许多具有挑战性的特征，例如类不平衡，这对建立在这些数据集上的监督分类模型的性能产生了不利影响。集成学习和数据挖掘领域的数据采样等技术可以缓解这一问题并提高分类性能。在本研究中，我们试图寻求在集成框架中包含数据采样是否可以进一步提高分类模型的性能。为此，我们在15个高度不平衡的生物信息学数据集上使用了两种新的混合集成技术进行了实验研究，一种技术在增强过程中集成了特征选择，另一种技术在增强框架内集成了随机欠采样和特征选择，两种学习器、三种形式的特征排序器和四种特征子集大小。我们的结果和统计分析表明，两种提升方法之间的差异在统计上不显著。因此，由于数据采样的加入对集成分类器的性能没有显著的积极影响，所以不需要达到最大的分类性能。据我们所知，这是第一个检验数据采样、随机欠采样对增强算法对高度不平衡生物信息学数据的分类性能的影响的实证研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量