The incremental SMOTE: A new approach based on the incremental k-means algorithm for solving imbalanced data set problem

IF 8.1 1区计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Sciences Pub Date : 2025-03-19 DOI:10.1016/j.ins.2025.122103

Duygu Selin Turan, Burak Ordin

{"title":"The incremental SMOTE: A new approach based on the incremental k-means algorithm for solving imbalanced data set problem","authors":"Duygu Selin Turan, Burak Ordin","doi":"10.1016/j.ins.2025.122103","DOIUrl":null,"url":null,"abstract":"<div><div>Classification is one of the very important areas in data mining. In real-life problems, developed methods for modeling with the classification problem generally perform well on datasets where the class distribution is balanced. On the other hand, the data sets are often imbalanced and it is important to develop algorithms to solve the classification problem on imbalanced data sets. Imbalanced datasets are more difficult to classify than balanced datasets because learning a class with underrepresentation is difficult. Most real life problems are imbalanced. The class with the least number of data usually corresponds to rare cases and is more important. Learning these classes is critical accordingly. One of the most commonly used solution methods to solve this problem is to oversample the minor class. When oversampling, too many repetitions in the dataset can cause overfitting. For this reason, it is very important to ensure data diversity when oversampling. Therefore, this paper proposes a new oversampling methods (the incremental SMOTE) combining the incremental k-means algorithm and Synthetic minority oversampling technique (SMOTE). The original dataset is clustered with the incremental k-means algorithm and the clusters are filtered to determine the safe clusters. The number of points to be produced from the safe clusters is determined, and then new instances are produced with the improved SMOTE algorithm. In the incremental SMOTE, diversity in the dataset is achieved by generating with incremental rate. In order to evaluate the performance of the incremental SMOTE algorithm, classification was performed on imbalanced datasets, balanced datasets obtained by the random oversampling, SMOTE, Borderline-SMOTE and SVM SMOTE methods. Comparisons for 10 datasets showed that the performance of the proposed method improves as the imbalance ratio of the dataset increases.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"711 ","pages":"Article 122103"},"PeriodicalIF":8.1000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S002002552500235X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Classification is one of the very important areas in data mining. In real-life problems, developed methods for modeling with the classification problem generally perform well on datasets where the class distribution is balanced. On the other hand, the data sets are often imbalanced and it is important to develop algorithms to solve the classification problem on imbalanced data sets. Imbalanced datasets are more difficult to classify than balanced datasets because learning a class with underrepresentation is difficult. Most real life problems are imbalanced. The class with the least number of data usually corresponds to rare cases and is more important. Learning these classes is critical accordingly. One of the most commonly used solution methods to solve this problem is to oversample the minor class. When oversampling, too many repetitions in the dataset can cause overfitting. For this reason, it is very important to ensure data diversity when oversampling. Therefore, this paper proposes a new oversampling methods (the incremental SMOTE) combining the incremental k-means algorithm and Synthetic minority oversampling technique (SMOTE). The original dataset is clustered with the incremental k-means algorithm and the clusters are filtered to determine the safe clusters. The number of points to be produced from the safe clusters is determined, and then new instances are produced with the improved SMOTE algorithm. In the incremental SMOTE, diversity in the dataset is achieved by generating with incremental rate. In order to evaluate the performance of the incremental SMOTE algorithm, classification was performed on imbalanced datasets, balanced datasets obtained by the random oversampling, SMOTE, Borderline-SMOTE and SVM SMOTE methods. Comparisons for 10 datasets showed that the performance of the proposed method improves as the imbalance ratio of the dataset increases.

查看原文本刊更多论文

增量SMOTE：一种基于增量k-means算法的解决不平衡数据集问题的新方法

分类是数据挖掘中一个非常重要的领域。在现实问题中，开发的用于分类问题建模的方法通常在类分布平衡的数据集上表现良好。另一方面，数据集往往是不平衡的，开发算法来解决不平衡数据集上的分类问题是很重要的。不平衡数据集比平衡数据集更难分类，因为学习代表性不足的类是困难的。大多数现实生活中的问题都是不平衡的。数据数量最少的类通常对应于很少的情况，并且更重要。因此，学习这些课程至关重要。解决此问题最常用的解决方法之一是对次要类进行过采样。当过采样时，数据集中过多的重复会导致过拟合。因此，过采样时保证数据的多样性是非常重要的。因此，本文提出了一种将增量k-means算法和合成少数过采样技术（SMOTE）相结合的新过采样方法（增量SMOTE）。使用增量k-means算法对原始数据集进行聚类，并对聚类进行过滤以确定安全聚类。确定从安全簇中产生的点的数量，然后使用改进的SMOTE算法产生新的实例。在增量式SMOTE中，数据集的多样性是通过增量速率生成来实现的。为了评价增量式SMOTE算法的性能，分别对不平衡数据集、随机过采样、SMOTE、Borderline-SMOTE和SVM SMOTE方法得到的平衡数据集进行分类。对10个数据集的比较表明，该方法的性能随着数据集不平衡率的增加而提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Sciences 工程技术-计算机：信息系统

CiteScore

14.00

自引率

17.30%

发文量

1322

审稿时长

10.4 months

期刊介绍： Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions. Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.