Dynamic Ensemble Framework for Imbalanced Data Classification

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2025-01-13 DOI:10.1109/TKDE.2025.3528719

Tuanfei Zhu;Xingchen Hu;Xinwang Liu;En Zhu;Xinzhong Zhu;Huiying Xu

{"title":"Dynamic Ensemble Framework for Imbalanced Data Classification","authors":"Tuanfei Zhu;Xingchen Hu;Xinwang Liu;En Zhu;Xinzhong Zhu;Huiying Xu","doi":"10.1109/TKDE.2025.3528719","DOIUrl":null,"url":null,"abstract":"Dynamic ensemble has significantly greater potential space to improve the classification of imbalanced data compared to static ensemble. However, dynamic ensemble schemes are far less successful than static ensemble methods in the imbalanced learning field. Through an in-depth analysis on the behavior characteristics of dynamic ensemble, we find that there are some important problems that need to be addressed to release the full potential of dynamic ensemble, including but not limited to, correcting the component classifiers’ bias towards the majority classes, increasing the proportions of the positive classifiers (i.e., the component classifiers making correct prediction) for difficult samples, and providing the accurate competence estimations on the hard-to-classify samples w.r.t the classifier pool. Inspired by these, we propose a Dynamic Ensemble Framework for imbalanced data classification (imDEF). imDEF first uses the data generation method OREM<inline-formula><tex-math>$\\mathrm{_{G}}$</tex-math></inline-formula> to generate multiple artificial synthetic datasets, which have diverse class distributions by rebalancing the original imbalanced data. Based on each of such synthetic datasets, imDEF then utilizes a Classification Error-aware Self-Paced Sampling Ensemble (SPSE<inline-formula><tex-math>$\\mathrm{_{CE}}$</tex-math></inline-formula>) method to gradually focus more on difficult samples, to create a low-biased classifier pool and increase the proportions of the positive classifiers for the difficult samples. Finally, imDEF constructs a referee system to achieve the competence estimations by leveraging an Ensemble Margin-aware Self-Paced Sampling Ensemble (SPSE<inline-formula><tex-math>$\\mathrm{_{EM}}$</tex-math></inline-formula>) method. SPSE<inline-formula><tex-math>$\\mathrm{_{EM}}$</tex-math></inline-formula> incrementally strengthens the learning of the hard-to-classify samples, so that the competent levels of component classifiers could be estimated accurately. Extensive experiments demonstrate the effectiveness of imDEF. The source codes have been made publicly available on GitHub.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2456-2471"},"PeriodicalIF":8.9000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10839625/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Dynamic ensemble has significantly greater potential space to improve the classification of imbalanced data compared to static ensemble. However, dynamic ensemble schemes are far less successful than static ensemble methods in the imbalanced learning field. Through an in-depth analysis on the behavior characteristics of dynamic ensemble, we find that there are some important problems that need to be addressed to release the full potential of dynamic ensemble, including but not limited to, correcting the component classifiers’ bias towards the majority classes, increasing the proportions of the positive classifiers (i.e., the component classifiers making correct prediction) for difficult samples, and providing the accurate competence estimations on the hard-to-classify samples w.r.t the classifier pool. Inspired by these, we propose a Dynamic Ensemble Framework for imbalanced data classification (imDEF). imDEF first uses the data generation method OREM

$\mathrm{_{G}}$

to generate multiple artificial synthetic datasets, which have diverse class distributions by rebalancing the original imbalanced data. Based on each of such synthetic datasets, imDEF then utilizes a Classification Error-aware Self-Paced Sampling Ensemble (SPSE

$\mathrm{_{CE}}$

) method to gradually focus more on difficult samples, to create a low-biased classifier pool and increase the proportions of the positive classifiers for the difficult samples. Finally, imDEF constructs a referee system to achieve the competence estimations by leveraging an Ensemble Margin-aware Self-Paced Sampling Ensemble (SPSE

$\mathrm{_{EM}}$

) method. SPSE

$\mathrm{_{EM}}$

incrementally strengthens the learning of the hard-to-classify samples, so that the competent levels of component classifiers could be estimated accurately. Extensive experiments demonstrate the effectiveness of imDEF. The source codes have been made publicly available on GitHub.

查看原文本刊更多论文

不平衡数据分类的动态集成框架

与静态集成相比，动态集成在改进不平衡数据分类方面具有更大的潜力空间。然而，在不平衡学习领域，动态集成方案远不如静态集成方法成功。通过对动态集成行为特征的深入分析，我们发现要释放动态集成的全部潜力，还需要解决一些重要的问题，包括但不限于纠正组件分类器对大多数类的偏见，增加对困难样本的正分类器（即组件分类器做出正确预测）的比例，并对分类器池中难以分类的样本进行准确的能力估计。受此启发，我们提出了一个用于不平衡数据分类（imDEF）的动态集成框架。imDEF首先使用数据生成方法OREM$\mathrm{_{G}}$生成多个人工合成数据集，通过重新平衡原始的不平衡数据，生成具有不同类分布的人工合成数据集。基于每个这样的合成数据集，imDEF利用一种分类错误感知自同步采样集成（SPSE$\ mathm {_{CE}}$）方法，逐渐将更多的注意力集中在困难样本上，创建一个低偏差的分类器池，并增加困难样本的正分类器比例。最后，imDEF构建了一个裁判系统，利用集成边缘感知自节奏采样集成（SPSE$\ mathm {_{EM}}$）方法来实现能力估计。SPSE$\ mathm {_{EM}}$逐步加强对难以分类样本的学习，从而准确估计出成分分类器的胜任水平。大量的实验证明了该方法的有效性。源代码已经在GitHub上公开提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.