Mahmudul Hasan Popel, Khan Md Hasib, Syed Ahsan Habib, Faisal Muhammad Shah
{"title":"A Hybrid Under-Sampling Method (HUSBoost) to Classify Imbalanced Data","authors":"Mahmudul Hasan Popel, Khan Md Hasib, Syed Ahsan Habib, Faisal Muhammad Shah","doi":"10.1109/ICCITECHN.2018.8631915","DOIUrl":null,"url":null,"abstract":"Imbalanced learning is the issue of learning from data when the class distribution is highly imbalanced. Class imbalance problems are seen increasingly in many domains and pose a challenge to traditional classification techniques. Learning from imbalanced data (two or more classes) creates additional complexities. Studies suggest that ensemble methods can produce more accurate results than regular Imbalance learning techniques (sampling and cost-sensitive learning). To deal with the problem, we propose a new hybrid under sampling based ensemble approach (HUSBoost) to handle imbalanced data which includes three basic steps- data cleaning, data balancing and classification steps. At first, we remove the noisy data using Tomek-Links. After that we create several balanced subsets by applying random under sampling (RUS) method to the majority class instances. These under sampled majority class instances and the minority class instances constitute the subsets of the imbalanced data-set. Having the same number of majority and minority class instances, they become balanced subsets of data. Then in each balanced subset, random forest (RF), AdaBoost with decision tree (CART) and AdaBoost with Support Vector Machine (SVM) are implemented in parallel where we use soft voting approach to get the combined result. From these ensemble classifiers we get the average result from all the balanced subsets. We also use 27 data-sets with different imbalanced ratio in order to verify the effectiveness of our proposed model and compare the experimental results of our model with RUSBoost and EasyEnsemble method.","PeriodicalId":355984,"journal":{"name":"2018 21st International Conference of Computer and Information Technology (ICCIT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 21st International Conference of Computer and Information Technology (ICCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCITECHN.2018.8631915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
Abstract
Imbalanced learning is the issue of learning from data when the class distribution is highly imbalanced. Class imbalance problems are seen increasingly in many domains and pose a challenge to traditional classification techniques. Learning from imbalanced data (two or more classes) creates additional complexities. Studies suggest that ensemble methods can produce more accurate results than regular Imbalance learning techniques (sampling and cost-sensitive learning). To deal with the problem, we propose a new hybrid under sampling based ensemble approach (HUSBoost) to handle imbalanced data which includes three basic steps- data cleaning, data balancing and classification steps. At first, we remove the noisy data using Tomek-Links. After that we create several balanced subsets by applying random under sampling (RUS) method to the majority class instances. These under sampled majority class instances and the minority class instances constitute the subsets of the imbalanced data-set. Having the same number of majority and minority class instances, they become balanced subsets of data. Then in each balanced subset, random forest (RF), AdaBoost with decision tree (CART) and AdaBoost with Support Vector Machine (SVM) are implemented in parallel where we use soft voting approach to get the combined result. From these ensemble classifiers we get the average result from all the balanced subsets. We also use 27 data-sets with different imbalanced ratio in order to verify the effectiveness of our proposed model and compare the experimental results of our model with RUSBoost and EasyEnsemble method.