{"title":"Selective ensemble learning algorithm for imbalanced dataset","authors":"Hongle Du, Yan Zhang, Lin Zhang, Yeh-Cheng Chen","doi":"10.2298/csis220817023d","DOIUrl":null,"url":null,"abstract":"Under the imbalanced dataset, the performance of the base-classifier, the computing method of weight of base-classifier and the selection method of the base-classifier have a great impact on the performance of the ensemble classifier. In order to solve above problem to improve the generalization performance of ensemble classifier, a selective ensemble learning algorithm based on under-sampling for imbalanced dataset is proposed. First, the proposed algorithm calculates the number K of under-sampling samples according to the relationship between class sample density. Then, we use the improved K-means clustering algorithm to under-sample the majority class samples and obtain K cluster centers. Then, all cluster centers (or the sample of the nearest cluster center) are regarded as new majority samples to construct a new balanced training subset combine with the minority class?s samples. Repeat those processes to generate multiple training subsets and get multiple base-classifiers. However, with the increasing of iterations, the number of base-classifiers increase, and the similarity among the base-classifiers will also increase. Therefore, it is necessary to select some base-classifier with good classification performance and large difference for ensemble. In the stage of selecting base-classifiers, according to the difference and performance of base-classifiers, we use the idea of maximum correlation and minimum redundancy to select base-classifiers. In the ensemble stage, G-mean or F-mean is selected to evaluate the classification performance of base-classifier for imbalanced dataset. That is to say, it is selected to compute the weight of each base-classifier. And then the weighted voting method is used for ensemble. Finally, the simulation results on the artificial dataset, UCI dataset and KDDCUP dataset show that the algorithm has good generalization performance on imbalanced dataset, especially on the dataset with high imbalance degree.","PeriodicalId":50636,"journal":{"name":"Computer Science and Information Systems","volume":"8 1","pages":"831-856"},"PeriodicalIF":1.2000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science and Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.2298/csis220817023d","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Under the imbalanced dataset, the performance of the base-classifier, the computing method of weight of base-classifier and the selection method of the base-classifier have a great impact on the performance of the ensemble classifier. In order to solve above problem to improve the generalization performance of ensemble classifier, a selective ensemble learning algorithm based on under-sampling for imbalanced dataset is proposed. First, the proposed algorithm calculates the number K of under-sampling samples according to the relationship between class sample density. Then, we use the improved K-means clustering algorithm to under-sample the majority class samples and obtain K cluster centers. Then, all cluster centers (or the sample of the nearest cluster center) are regarded as new majority samples to construct a new balanced training subset combine with the minority class?s samples. Repeat those processes to generate multiple training subsets and get multiple base-classifiers. However, with the increasing of iterations, the number of base-classifiers increase, and the similarity among the base-classifiers will also increase. Therefore, it is necessary to select some base-classifier with good classification performance and large difference for ensemble. In the stage of selecting base-classifiers, according to the difference and performance of base-classifiers, we use the idea of maximum correlation and minimum redundancy to select base-classifiers. In the ensemble stage, G-mean or F-mean is selected to evaluate the classification performance of base-classifier for imbalanced dataset. That is to say, it is selected to compute the weight of each base-classifier. And then the weighted voting method is used for ensemble. Finally, the simulation results on the artificial dataset, UCI dataset and KDDCUP dataset show that the algorithm has good generalization performance on imbalanced dataset, especially on the dataset with high imbalance degree.
期刊介绍:
About the journal
Home page
Contact information
Aims and scope
Indexing information
Editorial policies
ComSIS consortium
Journal boards
Managing board
For authors
Information for contributors
Paper submission
Article submission through OJS
Copyright transfer form
Download section
For readers
Forthcoming articles
Current issue
Archive
Subscription
For reviewers
View and review submissions
News
Journal''s Facebook page
Call for special issue
New issue notification
Aims and scope
Computer Science and Information Systems (ComSIS) is an international refereed journal, published in Serbia. The objective of ComSIS is to communicate important research and development results in the areas of computer science, software engineering, and information systems.