{"title":"A Novel Expandable Borderline Smote Over-Sampling Method for Class Imbalance Problem","authors":"Hao Sun;Jianping Li;Xiaoqian Zhu","doi":"10.1109/TKDE.2025.3544284","DOIUrl":null,"url":null,"abstract":"The class imbalance problem can cause classifiers to be biased toward the majority class and inclined to generate incorrect predictions. While existing studies have proposed numerous oversampling methods to alleviate class imbalance by generating extra minority class samples, these methods still have some inherent weaknesses and make the generated samples less informative. This study proposes a novel over-sampling method named the Expandable Borderline Smote (EB-Smote), which can address the weaknesses of existing over-sampling methods and generate more informative synthetic samples. In EB-Smote, not only minority class but also majority class is oversampled, and the synthetic samples are generated in the area between the selected minority and majority samples, which are close to the borderlines of their respective classes. EB-Smote can generate more informative samples by expanding the borderlines of minority and majority classes toward the actual decision boundary. Based on 27 imbalanced datasets and commonly used machine learning models, the experimental results demonstrate that EB-Smote significantly outperforms the other 8 existing oversampling methods. This study can provide theoretical guidance and practical recommendations to solve the crucial class imbalance problem in classification tasks.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2183-2199"},"PeriodicalIF":8.9000,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10902063/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The class imbalance problem can cause classifiers to be biased toward the majority class and inclined to generate incorrect predictions. While existing studies have proposed numerous oversampling methods to alleviate class imbalance by generating extra minority class samples, these methods still have some inherent weaknesses and make the generated samples less informative. This study proposes a novel over-sampling method named the Expandable Borderline Smote (EB-Smote), which can address the weaknesses of existing over-sampling methods and generate more informative synthetic samples. In EB-Smote, not only minority class but also majority class is oversampled, and the synthetic samples are generated in the area between the selected minority and majority samples, which are close to the borderlines of their respective classes. EB-Smote can generate more informative samples by expanding the borderlines of minority and majority classes toward the actual decision boundary. Based on 27 imbalanced datasets and commonly used machine learning models, the experimental results demonstrate that EB-Smote significantly outperforms the other 8 existing oversampling methods. This study can provide theoretical guidance and practical recommendations to solve the crucial class imbalance problem in classification tasks.
期刊介绍:
The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.