不平衡数据集的选择性集成学习算法

IF 1.2 4区计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computer Science and Information Systems Pub Date : 2023-01-01 DOI:10.2298/csis220817023d

Hongle Du, Yan Zhang, Lin Zhang, Yeh-Cheng Chen

{"title":"不平衡数据集的选择性集成学习算法","authors":"Hongle Du, Yan Zhang, Lin Zhang, Yeh-Cheng Chen","doi":"10.2298/csis220817023d","DOIUrl":null,"url":null,"abstract":"Under the imbalanced dataset, the performance of the base-classifier, the computing method of weight of base-classifier and the selection method of the base-classifier have a great impact on the performance of the ensemble classifier. In order to solve above problem to improve the generalization performance of ensemble classifier, a selective ensemble learning algorithm based on under-sampling for imbalanced dataset is proposed. First, the proposed algorithm calculates the number K of under-sampling samples according to the relationship between class sample density. Then, we use the improved K-means clustering algorithm to under-sample the majority class samples and obtain K cluster centers. Then, all cluster centers (or the sample of the nearest cluster center) are regarded as new majority samples to construct a new balanced training subset combine with the minority class?s samples. Repeat those processes to generate multiple training subsets and get multiple base-classifiers. However, with the increasing of iterations, the number of base-classifiers increase, and the similarity among the base-classifiers will also increase. Therefore, it is necessary to select some base-classifier with good classification performance and large difference for ensemble. In the stage of selecting base-classifiers, according to the difference and performance of base-classifiers, we use the idea of maximum correlation and minimum redundancy to select base-classifiers. In the ensemble stage, G-mean or F-mean is selected to evaluate the classification performance of base-classifier for imbalanced dataset. That is to say, it is selected to compute the weight of each base-classifier. And then the weighted voting method is used for ensemble. Finally, the simulation results on the artificial dataset, UCI dataset and KDDCUP dataset show that the algorithm has good generalization performance on imbalanced dataset, especially on the dataset with high imbalance degree.","PeriodicalId":50636,"journal":{"name":"Computer Science and Information Systems","volume":"8 1","pages":"831-856"},"PeriodicalIF":1.2000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Selective ensemble learning algorithm for imbalanced dataset\",\"authors\":\"Hongle Du, Yan Zhang, Lin Zhang, Yeh-Cheng Chen\",\"doi\":\"10.2298/csis220817023d\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Under the imbalanced dataset, the performance of the base-classifier, the computing method of weight of base-classifier and the selection method of the base-classifier have a great impact on the performance of the ensemble classifier. In order to solve above problem to improve the generalization performance of ensemble classifier, a selective ensemble learning algorithm based on under-sampling for imbalanced dataset is proposed. First, the proposed algorithm calculates the number K of under-sampling samples according to the relationship between class sample density. Then, we use the improved K-means clustering algorithm to under-sample the majority class samples and obtain K cluster centers. Then, all cluster centers (or the sample of the nearest cluster center) are regarded as new majority samples to construct a new balanced training subset combine with the minority class?s samples. Repeat those processes to generate multiple training subsets and get multiple base-classifiers. However, with the increasing of iterations, the number of base-classifiers increase, and the similarity among the base-classifiers will also increase. Therefore, it is necessary to select some base-classifier with good classification performance and large difference for ensemble. In the stage of selecting base-classifiers, according to the difference and performance of base-classifiers, we use the idea of maximum correlation and minimum redundancy to select base-classifiers. In the ensemble stage, G-mean or F-mean is selected to evaluate the classification performance of base-classifier for imbalanced dataset. That is to say, it is selected to compute the weight of each base-classifier. And then the weighted voting method is used for ensemble. Finally, the simulation results on the artificial dataset, UCI dataset and KDDCUP dataset show that the algorithm has good generalization performance on imbalanced dataset, especially on the dataset with high imbalance degree.\",\"PeriodicalId\":50636,\"journal\":{\"name\":\"Computer Science and Information Systems\",\"volume\":\"8 1\",\"pages\":\"831-856\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Science and Information Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.2298/csis220817023d\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science and Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.2298/csis220817023d","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在不平衡数据集下，基分类器的性能、基分类器权值的计算方法和基分类器的选择方法对集成分类器的性能有很大的影响。为了解决上述问题，提高集成分类器的泛化性能，提出了一种针对不平衡数据集的基于欠采样的选择性集成学习算法。首先，该算法根据类样本密度之间的关系计算欠采样样本个数K。然后，我们使用改进的K-means聚类算法对大多数类样本进行欠采样，得到K个聚类中心。然后，将所有聚类中心(或离聚类中心最近的样本)作为新的多数样本，与少数类结合构建新的平衡训练子集。年代样品。重复这些过程以生成多个训练子集并获得多个基分类器。但是，随着迭代次数的增加，基分类器的数量也会增加，基分类器之间的相似度也会增加。因此，有必要选择一些分类性能好、差异大的基分类器进行集成。在基分类器的选择阶段，根据基分类器的差异和性能，采用最大相关最小冗余的思想进行基分类器的选择。在集成阶段，选择g均值或f均值来评价基分类器对不平衡数据集的分类性能。也就是说，选择它来计算每个基分类器的权值。然后采用加权投票法进行集成。最后，在人工数据集、UCI数据集和KDDCUP数据集上的仿真结果表明，该算法对不平衡数据集，特别是对高度不平衡数据集具有良好的泛化性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Selective ensemble learning algorithm for imbalanced dataset

Under the imbalanced dataset, the performance of the base-classifier, the computing method of weight of base-classifier and the selection method of the base-classifier have a great impact on the performance of the ensemble classifier. In order to solve above problem to improve the generalization performance of ensemble classifier, a selective ensemble learning algorithm based on under-sampling for imbalanced dataset is proposed. First, the proposed algorithm calculates the number K of under-sampling samples according to the relationship between class sample density. Then, we use the improved K-means clustering algorithm to under-sample the majority class samples and obtain K cluster centers. Then, all cluster centers (or the sample of the nearest cluster center) are regarded as new majority samples to construct a new balanced training subset combine with the minority class?s samples. Repeat those processes to generate multiple training subsets and get multiple base-classifiers. However, with the increasing of iterations, the number of base-classifiers increase, and the similarity among the base-classifiers will also increase. Therefore, it is necessary to select some base-classifier with good classification performance and large difference for ensemble. In the stage of selecting base-classifiers, according to the difference and performance of base-classifiers, we use the idea of maximum correlation and minimum redundancy to select base-classifiers. In the ensemble stage, G-mean or F-mean is selected to evaluate the classification performance of base-classifier for imbalanced dataset. That is to say, it is selected to compute the weight of each base-classifier. And then the weighted voting method is used for ensemble. Finally, the simulation results on the artificial dataset, UCI dataset and KDDCUP dataset show that the algorithm has good generalization performance on imbalanced dataset, especially on the dataset with high imbalance degree.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Science and Information Systems COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

2.30

自引率

21.40%

发文量

审稿时长

7.5 months

期刊介绍： About the journal Home page Contact information Aims and scope Indexing information Editorial policies ComSIS consortium Journal boards Managing board For authors Information for contributors Paper submission Article submission through OJS Copyright transfer form Download section For readers Forthcoming articles Current issue Archive Subscription For reviewers View and review submissions News Journal''s Facebook page Call for special issue New issue notification Aims and scope Computer Science and Information Systems (ComSIS) is an international refereed journal, published in Serbia. The objective of ComSIS is to communicate important research and development results in the areas of computer science, software engineering, and information systems.