欠采样和细化合成少数派集的框架

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-04-09 DOI:10.1016/j.asoc.2025.113095

Payel Sadhukhan

{"title":"欠采样和细化合成少数派集的框架","authors":"Payel Sadhukhan","doi":"10.1016/j.asoc.2025.113095","DOIUrl":null,"url":null,"abstract":"<div><div>Oversampling the minority class is a popular strategy for coping with the imbalance of datasets. It improves the cognition of the minority points to an admissible extent. Nonetheless, the synthetic minority instances accentuate the overlap between the majority class and the augmented minority class. It is detrimental to the rightful cognition of both classes. To this end, this paper introduces a novel strategy to undersample the synthetic minority set. A multi-armed bandit (MAB) guided protocol is followed to [i] identify the synthetic minority instances that contribute to the increased overlap between the two classes and [ii] subsequently remove (undersample) them iteratively to obtain a refined synthetic minority set. Simulation on synthetic datasets shows that the proposed strategy is successful in increasing the Gromov–Wasserstein distance between the original majority class distribution and the synthetic minority points’ distribution (as compared to the regular oversampled data obtained through state-of-the-art techniques). Empirical evaluation in sixteen real-world datasets, four state-of-the-art minority oversamplers, and two refinement techniques manifest the competence of the proposed strategy over baseline results and against the two competing methods. The proposed strategy has improved the performance of the majority class without bringing down the minority class’s performance and can be incorporated in sensitive real-world domains.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"175 ","pages":"Article 113095"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A framework to undersample and refine the synthetic minority set\",\"authors\":\"Payel Sadhukhan\",\"doi\":\"10.1016/j.asoc.2025.113095\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Oversampling the minority class is a popular strategy for coping with the imbalance of datasets. It improves the cognition of the minority points to an admissible extent. Nonetheless, the synthetic minority instances accentuate the overlap between the majority class and the augmented minority class. It is detrimental to the rightful cognition of both classes. To this end, this paper introduces a novel strategy to undersample the synthetic minority set. A multi-armed bandit (MAB) guided protocol is followed to [i] identify the synthetic minority instances that contribute to the increased overlap between the two classes and [ii] subsequently remove (undersample) them iteratively to obtain a refined synthetic minority set. Simulation on synthetic datasets shows that the proposed strategy is successful in increasing the Gromov–Wasserstein distance between the original majority class distribution and the synthetic minority points’ distribution (as compared to the regular oversampled data obtained through state-of-the-art techniques). Empirical evaluation in sixteen real-world datasets, four state-of-the-art minority oversamplers, and two refinement techniques manifest the competence of the proposed strategy over baseline results and against the two competing methods. The proposed strategy has improved the performance of the majority class without bringing down the minority class’s performance and can be incorporated in sensitive real-world domains.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"175 \",\"pages\":\"Article 113095\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625004065\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625004065","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

对少数类进行过采样是处理数据集不平衡的一种流行策略。它在一定程度上提高了对少数民族的认识。尽管如此，合成的少数群体实例强调了多数群体和增强的少数群体之间的重叠。这不利于两个阶级的正确认识。为此，本文引入了一种对合成少数派集进行欠采样的新策略。遵循多臂盗匪（MAB）指导协议，以[i]识别导致两个类别之间重叠增加的合成少数派实例，[ii]随后迭代删除（欠采样）它们以获得精炼的合成少数派集。在合成数据集上的仿真表明，所提出的策略成功地增加了原始多数类分布与合成少数点分布之间的Gromov-Wasserstein距离（与通过最先进的技术获得的规则过采样数据相比）。对16个真实世界数据集、4个最先进的少数过采样器和两种改进技术的实证评估表明，所提出的策略优于基线结果，并优于两种竞争方法。所提出的策略在不降低少数类性能的情况下提高了多数类的性能，并且可以将其纳入敏感的现实世界领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A framework to undersample and refine the synthetic minority set

Oversampling the minority class is a popular strategy for coping with the imbalance of datasets. It improves the cognition of the minority points to an admissible extent. Nonetheless, the synthetic minority instances accentuate the overlap between the majority class and the augmented minority class. It is detrimental to the rightful cognition of both classes. To this end, this paper introduces a novel strategy to undersample the synthetic minority set. A multi-armed bandit (MAB) guided protocol is followed to [i] identify the synthetic minority instances that contribute to the increased overlap between the two classes and [ii] subsequently remove (undersample) them iteratively to obtain a refined synthetic minority set. Simulation on synthetic datasets shows that the proposed strategy is successful in increasing the Gromov–Wasserstein distance between the original majority class distribution and the synthetic minority points’ distribution (as compared to the regular oversampled data obtained through state-of-the-art techniques). Empirical evaluation in sixteen real-world datasets, four state-of-the-art minority oversamplers, and two refinement techniques manifest the competence of the proposed strategy over baseline results and against the two competing methods. The proposed strategy has improved the performance of the majority class without bringing down the minority class’s performance and can be incorporated in sensitive real-world domains.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.