A Hybrid Under-Sampling Method (HUSBoost) to Classify Imbalanced Data

Mahmudul Hasan Popel, Khan Md Hasib, Syed Ahsan Habib, Faisal Muhammad Shah
{"title":"A Hybrid Under-Sampling Method (HUSBoost) to Classify Imbalanced Data","authors":"Mahmudul Hasan Popel, Khan Md Hasib, Syed Ahsan Habib, Faisal Muhammad Shah","doi":"10.1109/ICCITECHN.2018.8631915","DOIUrl":null,"url":null,"abstract":"Imbalanced learning is the issue of learning from data when the class distribution is highly imbalanced. Class imbalance problems are seen increasingly in many domains and pose a challenge to traditional classification techniques. Learning from imbalanced data (two or more classes) creates additional complexities. Studies suggest that ensemble methods can produce more accurate results than regular Imbalance learning techniques (sampling and cost-sensitive learning). To deal with the problem, we propose a new hybrid under sampling based ensemble approach (HUSBoost) to handle imbalanced data which includes three basic steps- data cleaning, data balancing and classification steps. At first, we remove the noisy data using Tomek-Links. After that we create several balanced subsets by applying random under sampling (RUS) method to the majority class instances. These under sampled majority class instances and the minority class instances constitute the subsets of the imbalanced data-set. Having the same number of majority and minority class instances, they become balanced subsets of data. Then in each balanced subset, random forest (RF), AdaBoost with decision tree (CART) and AdaBoost with Support Vector Machine (SVM) are implemented in parallel where we use soft voting approach to get the combined result. From these ensemble classifiers we get the average result from all the balanced subsets. We also use 27 data-sets with different imbalanced ratio in order to verify the effectiveness of our proposed model and compare the experimental results of our model with RUSBoost and EasyEnsemble method.","PeriodicalId":355984,"journal":{"name":"2018 21st International Conference of Computer and Information Technology (ICCIT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 21st International Conference of Computer and Information Technology (ICCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCITECHN.2018.8631915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

Imbalanced learning is the issue of learning from data when the class distribution is highly imbalanced. Class imbalance problems are seen increasingly in many domains and pose a challenge to traditional classification techniques. Learning from imbalanced data (two or more classes) creates additional complexities. Studies suggest that ensemble methods can produce more accurate results than regular Imbalance learning techniques (sampling and cost-sensitive learning). To deal with the problem, we propose a new hybrid under sampling based ensemble approach (HUSBoost) to handle imbalanced data which includes three basic steps- data cleaning, data balancing and classification steps. At first, we remove the noisy data using Tomek-Links. After that we create several balanced subsets by applying random under sampling (RUS) method to the majority class instances. These under sampled majority class instances and the minority class instances constitute the subsets of the imbalanced data-set. Having the same number of majority and minority class instances, they become balanced subsets of data. Then in each balanced subset, random forest (RF), AdaBoost with decision tree (CART) and AdaBoost with Support Vector Machine (SVM) are implemented in parallel where we use soft voting approach to get the combined result. From these ensemble classifiers we get the average result from all the balanced subsets. We also use 27 data-sets with different imbalanced ratio in order to verify the effectiveness of our proposed model and compare the experimental results of our model with RUSBoost and EasyEnsemble method.
不平衡数据分类的混合欠采样方法(HUSBoost)
不平衡学习是在类分布高度不平衡的情况下从数据中学习的问题。分类不平衡问题在许多领域日益突出,对传统的分类技术提出了挑战。从不平衡的数据(两个或更多类)中学习会带来额外的复杂性。研究表明,集成方法比常规的不平衡学习技术(采样和成本敏感学习)产生更准确的结果。为了解决这一问题,我们提出了一种新的基于采样的混合集成方法(HUSBoost)来处理不平衡数据,该方法包括数据清洗、数据平衡和分类三个基本步骤。首先,我们使用Tomek-Links去除噪声数据。然后,我们通过对大多数类实例应用随机抽样(RUS)方法来创建几个平衡子集。这些采样下的多数类实例和少数类实例构成了不平衡数据集的子集。拥有相同数量的多数类和少数类实例,它们成为平衡的数据子集。然后在每个平衡子集中,并行实现随机森林(RF), AdaBoost与决策树(CART)和AdaBoost与支持向量机(SVM),并使用软投票方法获得组合结果。从这些集成分类器中,我们得到所有平衡子集的平均结果。为了验证模型的有效性,我们还使用了27个不同不平衡率的数据集,并将模型的实验结果与RUSBoost和EasyEnsemble方法进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信