基于kde的合成采样改进不平衡基因组数据分类。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-08-29 DOI:10.1186/s13040-025-00474-5

Edoardo Taccaliti, Jesus S Aguilar-Ruiz

{"title":"基于kde的合成采样改进不平衡基因组数据分类。","authors":"Edoardo Taccaliti, Jesus S Aguilar-Ruiz","doi":"10.1186/s13040-025-00474-5","DOIUrl":null,"url":null,"abstract":"Class imbalance poses a serious challenge in biomedical machine learning, particularly in genomics, where datasets are characterized by extremely high dimensionality and very limited sample sizes. In such settings, standard classifiers tend to favor the majority class, leading to biased predictions - an especially problematic issue in clinical diagnostics where rare conditions must not be overlooked. In this study, we introduce a Kernel Density Estimation (KDE)-based oversampling approach to rebalance imbalanced genomic datasets by generating synthetic minority class samples. Unlike conventional methods such as SMOTE, KDE estimates the global probability distribution of the minority class and resamples accordingly, avoiding local interpolation pitfalls. We evaluate our method on 15 real-world genomic datasets using three classifiers -Naïve Bayes, Decision Trees, and Random Forests- and compare it to SMOTE and baseline training. Experimental results demonstrate that KDE oversampling consistently improves classification performance, especially in metrics robust to imbalance, such as AUC of the IMCP curve. Notably, KDE achieves superior results in tree-based models while dramatically simplifying the sampling process. This approach offers a statistically grounded and effective solution for balancing genomic datasets, with strong potential for improving fairness and accuracy in high-stakes medical decision-making.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"60"},"PeriodicalIF":6.1000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395650/pdf/","citationCount":"0","resultStr":"{\"title\":\"Improving classification on imbalanced genomic data via KDE-based synthetic sampling.\",\"authors\":\"Edoardo Taccaliti, Jesus S Aguilar-Ruiz\",\"doi\":\"10.1186/s13040-025-00474-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Class imbalance poses a serious challenge in biomedical machine learning, particularly in genomics, where datasets are characterized by extremely high dimensionality and very limited sample sizes. In such settings, standard classifiers tend to favor the majority class, leading to biased predictions - an especially problematic issue in clinical diagnostics where rare conditions must not be overlooked. In this study, we introduce a Kernel Density Estimation (KDE)-based oversampling approach to rebalance imbalanced genomic datasets by generating synthetic minority class samples. Unlike conventional methods such as SMOTE, KDE estimates the global probability distribution of the minority class and resamples accordingly, avoiding local interpolation pitfalls. We evaluate our method on 15 real-world genomic datasets using three classifiers -Naïve Bayes, Decision Trees, and Random Forests- and compare it to SMOTE and baseline training. Experimental results demonstrate that KDE oversampling consistently improves classification performance, especially in metrics robust to imbalance, such as AUC of the IMCP curve. Notably, KDE achieves superior results in tree-based models while dramatically simplifying the sampling process. This approach offers a statistically grounded and effective solution for balancing genomic datasets, with strong potential for improving fairness and accuracy in high-stakes medical decision-making.\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":\"18 1\",\"pages\":\"60\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2025-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395650/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-025-00474-5\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00474-5","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

类不平衡在生物医学机器学习中构成了严峻的挑战，特别是在基因组学中，数据集的特征是极高的维度和非常有限的样本量。在这种情况下，标准分类器倾向于支持大多数类别，导致有偏见的预测-这在临床诊断中是一个特别有问题的问题，因为罕见的情况不容忽视。在这项研究中，我们引入了一种基于核密度估计（KDE）的过采样方法，通过生成合成的少数类样本来重新平衡不平衡的基因组数据集。与SMOTE等传统方法不同，KDE估计少数类的全局概率分布并相应地重新采样，从而避免了局部插值陷阱。我们使用三种分类器-Naïve贝叶斯、决策树和随机森林在15个真实世界的基因组数据集上评估了我们的方法，并将其与SMOTE和基线训练进行比较。实验结果表明，KDE过采样持续提高了分类性能，特别是在抗不平衡指标（如IMCP曲线的AUC）方面。值得注意的是，KDE在极大地简化采样过程的同时，在基于树的模型中取得了优异的结果。这种方法为平衡基因组数据集提供了一种统计基础和有效的解决方案，具有提高高风险医疗决策公平性和准确性的强大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Improving classification on imbalanced genomic data via KDE-based synthetic sampling.

查看原文本刊更多论文

Improving classification on imbalanced genomic data via KDE-based synthetic sampling.

Class imbalance poses a serious challenge in biomedical machine learning, particularly in genomics, where datasets are characterized by extremely high dimensionality and very limited sample sizes. In such settings, standard classifiers tend to favor the majority class, leading to biased predictions - an especially problematic issue in clinical diagnostics where rare conditions must not be overlooked. In this study, we introduce a Kernel Density Estimation (KDE)-based oversampling approach to rebalance imbalanced genomic datasets by generating synthetic minority class samples. Unlike conventional methods such as SMOTE, KDE estimates the global probability distribution of the minority class and resamples accordingly, avoiding local interpolation pitfalls. We evaluate our method on 15 real-world genomic datasets using three classifiers -Naïve Bayes, Decision Trees, and Random Forests- and compare it to SMOTE and baseline training. Experimental results demonstrate that KDE oversampling consistently improves classification performance, especially in metrics robust to imbalance, such as AUC of the IMCP curve. Notably, KDE achieves superior results in tree-based models while dramatically simplifying the sampling process. This approach offers a statistically grounded and effective solution for balancing genomic datasets, with strong potential for improving fairness and accuracy in high-stakes medical decision-making.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.