PSM-SMOTE: propensity score matching and synthetic minority oversampling for handling unbalanced microbiome data.

IF 1.7 4区 生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY
Jeongsup Moon, Zhe Liu, Taesung Park
{"title":"PSM-SMOTE: propensity score matching and synthetic minority oversampling for handling unbalanced microbiome data.","authors":"Jeongsup Moon, Zhe Liu, Taesung Park","doi":"10.1007/s13258-025-01688-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Predictive models using microbiome data often suffer from covariate imbalance and class imbalance, biasing results. Propensity Score Matching (PSM) balances covariates but reduces sample size, while borderline synthetic minority oversampling technique (borderline-SMOTE) oversamples minority classes but can generate uninformative examples.</p><p><strong>Objective: </strong>To develop and evaluate PSM-SMOTE, a novel hybrid sampling method that integrates PSM and borderline-SMOTE to handle both covariate and class imbalance in microbiome data.</p><p><strong>Methods: </strong>We developed PSM-SMOTE, a three-step hybrid sampling algorithm for microbiome data: (1) PSM at four caliper levels to balance covariates, (2) selection of at least ten robust differential markers via seven statistical tests with false discovery rate correction, and (3) application of borderline-SMOTE on the marker-based distance matrix to oversample minority classes. We evaluated PSM-SMOTE on three publicly available microbiome case-control datasets: pancreatic ductal adenocarcinoma (PDAC), colorectal cancer (CRC), and obesity, using logistic regression (LR), random forest (RF), and support vector machine (SVM) classifiers. Performance was assessed via area under the ROC curve (AUC).</p><p><strong>Results: </strong>PSM-SMOTE improved test AUCs in multiple model-dataset combinations compared with using PSM alone. Notably, for the RF model, PSM-SMOTE consistently enhanced AUC across nearly all oversampling settings in the PDAC and obesity cohorts. For the SVM model, PSM-SMOTE also achieved a significant AUC increase in the CRC cohort. For the LR model, PSM-SMOTE showed modest improvement under strict matching.</p><p><strong>Conclusion: </strong>PSM-SMOTE effectively addresses dual imbalance in microbiome data and consistently enhances performance, providing a practical solution for imbalanced data analyses.</p>","PeriodicalId":12675,"journal":{"name":"Genes & genomics","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genes & genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s13258-025-01688-x","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Predictive models using microbiome data often suffer from covariate imbalance and class imbalance, biasing results. Propensity Score Matching (PSM) balances covariates but reduces sample size, while borderline synthetic minority oversampling technique (borderline-SMOTE) oversamples minority classes but can generate uninformative examples.

Objective: To develop and evaluate PSM-SMOTE, a novel hybrid sampling method that integrates PSM and borderline-SMOTE to handle both covariate and class imbalance in microbiome data.

Methods: We developed PSM-SMOTE, a three-step hybrid sampling algorithm for microbiome data: (1) PSM at four caliper levels to balance covariates, (2) selection of at least ten robust differential markers via seven statistical tests with false discovery rate correction, and (3) application of borderline-SMOTE on the marker-based distance matrix to oversample minority classes. We evaluated PSM-SMOTE on three publicly available microbiome case-control datasets: pancreatic ductal adenocarcinoma (PDAC), colorectal cancer (CRC), and obesity, using logistic regression (LR), random forest (RF), and support vector machine (SVM) classifiers. Performance was assessed via area under the ROC curve (AUC).

Results: PSM-SMOTE improved test AUCs in multiple model-dataset combinations compared with using PSM alone. Notably, for the RF model, PSM-SMOTE consistently enhanced AUC across nearly all oversampling settings in the PDAC and obesity cohorts. For the SVM model, PSM-SMOTE also achieved a significant AUC increase in the CRC cohort. For the LR model, PSM-SMOTE showed modest improvement under strict matching.

Conclusion: PSM-SMOTE effectively addresses dual imbalance in microbiome data and consistently enhances performance, providing a practical solution for imbalanced data analyses.

PSM-SMOTE:倾向得分匹配和合成少数过采样处理不平衡的微生物组数据。
背景:利用微生物组数据建立的预测模型往往存在协变量不平衡和类不平衡,导致结果偏倚。倾向得分匹配(PSM)平衡了协变量,但减少了样本量,而边界合成少数过采样技术(borderline- smote)对少数类进行了过采样,但可能产生信息不足的样本。目的:开发并评价PSM- smote,一种结合PSM和borderline-SMOTE的新型混合采样方法,以处理微生物组数据的协变量和类不平衡。方法:我们开发了PSM- smote,这是一种微生物组数据的三步混合采样算法:(1)PSM在四个卡尺水平上平衡协变量,(2)通过七次统计检验选择至少十个鲁棒差异标记,并校正错误发现率,(3)在基于标记的距离矩阵上应用borderline-SMOTE对少数类进行过采样。我们使用逻辑回归(LR)、随机森林(RF)和支持向量机(SVM)分类器,在三个公开的微生物组病例对照数据集上评估了PSM-SMOTE:胰腺导管腺癌(PDAC)、结直肠癌(CRC)和肥胖。通过ROC曲线下面积(AUC)评估疗效。结果:与单独使用PSM相比,PSM- smote在多个模型-数据集组合中提高了测试auc。值得注意的是,对于RF模型,PSM-SMOTE在PDAC和肥胖队列中几乎所有过采样设置中都持续提高了AUC。对于SVM模型,PSM-SMOTE在CRC队列中也实现了显著的AUC增加。对于LR模型,PSM-SMOTE在严格匹配下表现出适度的改善。结论:PSM-SMOTE有效解决了微生物组数据的双重不平衡,并持续提高了性能,为不平衡数据分析提供了实用的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Genes & genomics
Genes & genomics 生物-生化与分子生物学
CiteScore
3.70
自引率
4.80%
发文量
131
审稿时长
6-12 weeks
期刊介绍: Genes & Genomics is an official journal of the Korean Genetics Society (http://kgenetics.or.kr/). Although it is an official publication of the Genetics Society of Korea, membership of the Society is not required for contributors. It is a peer-reviewed international journal publishing print (ISSN 1976-9571) and online version (E-ISSN 2092-9293). It covers all disciplines of genetics and genomics from prokaryotes to eukaryotes from fundamental heredity to molecular aspects. The articles can be reviews, research articles, and short communications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信