Improving clustering-based and adaptive position-aware interpolation oversampling for imbalanced data classification

IF 5.2 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Yujiang Wang , Marshima Mohd Rosli , Norzilah Musa , Lei Wang
{"title":"Improving clustering-based and adaptive position-aware interpolation oversampling for imbalanced data classification","authors":"Yujiang Wang ,&nbsp;Marshima Mohd Rosli ,&nbsp;Norzilah Musa ,&nbsp;Lei Wang","doi":"10.1016/j.jksuci.2024.102253","DOIUrl":null,"url":null,"abstract":"<div><div>Class imbalance is one of the most significant difficulties in modern machine learning. This is because of the inherent bias of standard classifiers toward favoring majority instances while often ignoring minority instances. Interpolation-based oversampling techniques are among the most popular solutions for generating synthetic minority samples to correct imbalanced class distributions. However, synthetic minority samples have a risk of overlapping with the majority-class samples. Inappropriate interpolation of minority samples during oversampling can also result in over generalization. To overcome these drawbacks, we propose a Clustering-based and Adaptive Position-aware Interpolation Oversampling algorithm (CAPAIO) for imbalanced binary dataset classification. CAPAIO initially employs an improved density-based clustering algorithm to group minority instances into inland, borderline, and trapped samples. It then adaptively determines the size of each subcluster and allocates weights to minority samples, guiding the synthesis of minority samples based on these weights. Finally, distinct interpolation oversampling algorithms are individually performed on these three categories of minority samples. The experimental results demonstrate the effectiveness of the proposed CAPAIO in most datasets compared with eleven other oversampling algorithms.</div></div>","PeriodicalId":48547,"journal":{"name":"Journal of King Saud University-Computer and Information Sciences","volume":"36 10","pages":"Article 102253"},"PeriodicalIF":5.2000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of King Saud University-Computer and Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1319157824003422","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Class imbalance is one of the most significant difficulties in modern machine learning. This is because of the inherent bias of standard classifiers toward favoring majority instances while often ignoring minority instances. Interpolation-based oversampling techniques are among the most popular solutions for generating synthetic minority samples to correct imbalanced class distributions. However, synthetic minority samples have a risk of overlapping with the majority-class samples. Inappropriate interpolation of minority samples during oversampling can also result in over generalization. To overcome these drawbacks, we propose a Clustering-based and Adaptive Position-aware Interpolation Oversampling algorithm (CAPAIO) for imbalanced binary dataset classification. CAPAIO initially employs an improved density-based clustering algorithm to group minority instances into inland, borderline, and trapped samples. It then adaptively determines the size of each subcluster and allocates weights to minority samples, guiding the synthesis of minority samples based on these weights. Finally, distinct interpolation oversampling algorithms are individually performed on these three categories of minority samples. The experimental results demonstrate the effectiveness of the proposed CAPAIO in most datasets compared with eleven other oversampling algorithms.
改进基于聚类和自适应位置感知的插值超采样,实现不平衡数据分类
类不平衡是现代机器学习中最重要的困难之一。这是因为标准分类器的固有偏见倾向于支持多数实例,而经常忽略少数实例。基于插值的过采样技术是生成合成少数样本以纠正不平衡类分布的最流行的解决方案之一。然而,合成的少数类样本有与多数类样本重叠的风险。过采样过程中对少数样本的不适当插值也会导致过泛化。为了克服这些缺点,我们提出了一种基于聚类的自适应位置感知插值过采样算法(CAPAIO)用于不平衡二值数据集分类。CAPAIO最初采用一种改进的基于密度的聚类算法,将少数样本分为内陆样本、边缘样本和捕获样本。然后自适应地确定每个子簇的大小,并为少数样本分配权重,指导基于这些权重的少数样本的合成。最后,对这三类少数样本分别进行了不同的插值过采样算法。实验结果表明,与其他11种过采样算法相比,本文提出的CAPAIO算法在大多数数据集上是有效的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
10.50
自引率
8.70%
发文量
656
审稿时长
29 days
期刊介绍: In 2022 the Journal of King Saud University - Computer and Information Sciences will become an author paid open access journal. Authors who submit their manuscript after October 31st 2021 will be asked to pay an Article Processing Charge (APC) after acceptance of their paper to make their work immediately, permanently, and freely accessible to all. The Journal of King Saud University Computer and Information Sciences is a refereed, international journal that covers all aspects of both foundations of computer and its practical applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信