用于不平衡分类的新型重叠最小化 SMOTE 算法

IF 2.9 3区工程技术 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers of Information Technology & Electronic Engineering Pub Date : 2024-09-05 DOI:10.1631/fitee.2300278

Yulin He, Xuan Lu, Philippe Fournier-Viger, Joshua Zhexue Huang

{"title":"用于不平衡分类的新型重叠最小化 SMOTE 算法","authors":"Yulin He, Xuan Lu, Philippe Fournier-Viger, Joshua Zhexue Huang","doi":"10.1631/fitee.2300278","DOIUrl":null,"url":null,"abstract":"<p>The synthetic minority oversampling technique (SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minority-class sample point generation algorithm, named overlapping minimization SMOTE (OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes, support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the GitHub platform at https://github.com/luxuan123123/OM-SMOTE/.</p>","PeriodicalId":12608,"journal":{"name":"Frontiers of Information Technology & Electronic Engineering","volume":"142 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel overlapping minimization SMOTE algorithm for imbalanced classification\",\"authors\":\"Yulin He, Xuan Lu, Philippe Fournier-Viger, Joshua Zhexue Huang\",\"doi\":\"10.1631/fitee.2300278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The synthetic minority oversampling technique (SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minority-class sample point generation algorithm, named overlapping minimization SMOTE (OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes, support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the GitHub platform at https://github.com/luxuan123123/OM-SMOTE/.</p>\",\"PeriodicalId\":12608,\"journal\":{\"name\":\"Frontiers of Information Technology & Electronic Engineering\",\"volume\":\"142 1\",\"pages\":\"\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers of Information Technology & Electronic Engineering\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1631/fitee.2300278\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Information Technology & Electronic Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1631/fitee.2300278","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

合成少数群体过度采样技术（SMOTE）是一种在构建分类器时减少类不平衡影响的流行算法，在过去的 20 年中得到了多次改进。SMOTE 及其变体在原始样本空间中合成了一些少数类样本点，以减轻类不平衡的不利影响。这种方法在很多情况下都很有效，但当合成样本点产生于不同类别之间的重叠区域时，就会出现问题，从而使分类器训练变得更加复杂。为了解决这个问题，本文提出了一种新颖的面向泛化而非面向估算的少数类样本点生成算法，命名为重叠最小化 SMOTE（OM-SMOTE）。该算法专为二元不平衡分类问题而设计。OM-SMOTE 首先通过平衡样本编码和分类器泛化，将原始样本点映射到一个新的样本空间。然后，OM-SMOTE 采用一套复杂的少数类样本点估算规则，生成尽可能远离类间重叠区域的合成样本点。为了验证 OM-SMOTE 的有效性，我们在 32 个不平衡数据集上进行了广泛的实验。结果表明，使用 OM-SMOTE 生成合成少数类样本点，与 11 种最先进的基于 SMOTE 的归因算法相比，能为天真贝叶斯、支持向量机、决策树和逻辑回归分类器带来更好的分类器训练性能。这表明，OM-SMOTE 是支持不平衡分类的高质量分类器训练的可行方法。OM-SMOTE 的实现在 GitHub 平台上公开共享，网址为 https://github.com/luxuan123123/OM-SMOTE/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A novel overlapping minimization SMOTE algorithm for imbalanced classification

The synthetic minority oversampling technique (SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minority-class sample point generation algorithm, named overlapping minimization SMOTE (OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes, support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the GitHub platform at https://github.com/luxuan123123/OM-SMOTE/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers of Information Technology & Electronic Engineering COMPUTER SCIENCE, INFORMATION SYSTEMSCOMPU-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

6.00

自引率

10.00%

发文量

1372

期刊介绍： Frontiers of Information Technology & Electronic Engineering (ISSN 2095-9184, monthly), formerly known as Journal of Zhejiang University SCIENCE C (Computers & Electronics) (2010-2014), is an international peer-reviewed journal launched by Chinese Academy of Engineering (CAE) and Zhejiang University, co-published by Springer & Zhejiang University Press. FITEE is aimed to publish the latest implementation of applications, principles, and algorithms in the broad area of Electrical and Electronic Engineering, including but not limited to Computer Science, Information Sciences, Control, Automation, Telecommunications. There are different types of articles for your choice, including research articles, review articles, science letters, perspective, new technical notes and methods, etc.