GDG: An evolutionary oversampling framework integrating Gaussian mixture modeling and genetic algorithm

IF 8.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Swarm and Evolutionary Computation Pub Date : 2026-04-01 Epub Date: 2026-04-06 DOI:10.1016/j.swevo.2026.102375

Yelin Zhang, Dongmei Wang, Yuehua Yu, Chen Chen, Chengwang Xie

{"title":"GDG: An evolutionary oversampling framework integrating Gaussian mixture modeling and genetic algorithm","authors":"Yelin Zhang, Dongmei Wang, Yuehua Yu, Chen Chen, Chengwang Xie","doi":"10.1016/j.swevo.2026.102375","DOIUrl":null,"url":null,"abstract":"<div><div>Class imbalance induces significant bias in machine learning classifiers. While oversampling mitigates this, a critical knowledge gap exists: conventional generative methods often assume unimodal distributions and fail to address complex boundary overlap, leading to noisy, low-fidelity synthetic samples. To bridge this gap, we propose GDG, a novel framework integrating Gaussian Mixture Model (GMM) and Genetic Algorithm (GA). First, GMM clusters minority samples to accurately capture intrinsic multi-modal structures. Subsequently, an innovative global–local mechanism adaptively allocates synthetic samples based on boundary complexity, effectively minimizing overlap. Lastly, the GA performs a nonlinear search within superspheres, utilizing adaptive fitness weights to balance exploration and exploitation for high-quality generation. Extensive experiments on 21 benchmark datasets demonstrate that GDG significantly outperforms nine state-of-the-art baselines, improving average Accuracy by 1.9%, G-mean by 6.0%, and AUC by 1.2%. Rigorous non-parametric statistical analysis confirms these differences (<span><math><mrow><mi>p</mi><mo>=</mo><mn>1</mn><mo>.</mo><mn>78</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>7</mn></mrow></msup></mrow></math></span>), with post-hoc Nemenyi testing verifying that GDG achieves the superior average rank of 2.17. These findings establish GDG as a robust, statistically validated solution for tackling complex class imbalance problems.</div></div>","PeriodicalId":48682,"journal":{"name":"Swarm and Evolutionary Computation","volume":"104 ","pages":"Article 102375"},"PeriodicalIF":8.5000,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Swarm and Evolutionary Computation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2210650226000957","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/4/6 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Class imbalance induces significant bias in machine learning classifiers. While oversampling mitigates this, a critical knowledge gap exists: conventional generative methods often assume unimodal distributions and fail to address complex boundary overlap, leading to noisy, low-fidelity synthetic samples. To bridge this gap, we propose GDG, a novel framework integrating Gaussian Mixture Model (GMM) and Genetic Algorithm (GA). First, GMM clusters minority samples to accurately capture intrinsic multi-modal structures. Subsequently, an innovative global–local mechanism adaptively allocates synthetic samples based on boundary complexity, effectively minimizing overlap. Lastly, the GA performs a nonlinear search within superspheres, utilizing adaptive fitness weights to balance exploration and exploitation for high-quality generation. Extensive experiments on 21 benchmark datasets demonstrate that GDG significantly outperforms nine state-of-the-art baselines, improving average Accuracy by 1.9%, G-mean by 6.0%, and AUC by 1.2%. Rigorous non-parametric statistical analysis confirms these differences (

p = 1.78 \times 1 0^{- 7}

), with post-hoc Nemenyi testing verifying that GDG achieves the superior average rank of 2.17. These findings establish GDG as a robust, statistically validated solution for tackling complex class imbalance problems.

查看原文本刊更多论文

融合高斯混合建模和遗传算法的进化过采样框架

在机器学习分类器中，类不平衡会引起显著的偏差。虽然过采样减轻了这一点，但存在一个关键的知识差距：传统的生成方法通常假设单峰分布，无法解决复杂的边界重叠，导致嘈杂，低保真的合成样本。为了弥补这一差距，我们提出了一种结合高斯混合模型（GMM）和遗传算法（GA）的新框架GDG。首先，GMM聚类少数样本，以准确捕获内在的多模态结构。随后，创新的全局-局部机制基于边界复杂度自适应分配合成样本，有效地减少了重叠。最后，遗传算法在超球内执行非线性搜索，利用自适应适应度权重来平衡探索和开发，以获得高质量的生成。在21个基准数据集上进行的大量实验表明，GDG显著优于9个最先进的基线，平均精度提高1.9%，g均值提高6.0%，AUC提高1.2%。严格的非参数统计分析证实了这些差异（p=1.78×10−7），事后Nemenyi检验验证了GDG达到了2.17的优越平均等级。这些发现确立了GDG是解决复杂的阶级失衡问题的可靠的、经过统计验证的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Swarm and Evolutionary Computation COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

16.00

自引率

12.00%

发文量

169

期刊介绍： Swarm and Evolutionary Computation is a pioneering peer-reviewed journal focused on the latest research and advancements in nature-inspired intelligent computation using swarm and evolutionary algorithms. It covers theoretical, experimental, and practical aspects of these paradigms and their hybrids, promoting interdisciplinary research. The journal prioritizes the publication of high-quality, original articles that push the boundaries of evolutionary computation and swarm intelligence. Additionally, it welcomes survey papers on current topics and novel applications. Topics of interest include but are not limited to: Genetic Algorithms, and Genetic Programming, Evolution Strategies, and Evolutionary Programming, Differential Evolution, Artificial Immune Systems, Particle Swarms, Ant Colony, Bacterial Foraging, Artificial Bees, Fireflies Algorithm, Harmony Search, Artificial Life, Digital Organisms, Estimation of Distribution Algorithms, Stochastic Diffusion Search, Quantum Computing, Nano Computing, Membrane Computing, Human-centric Computing, Hybridization of Algorithms, Memetic Computing, Autonomic Computing, Self-organizing systems, Combinatorial, Discrete, Binary, Constrained, Multi-objective, Multi-modal, Dynamic, and Large-scale Optimization.