{"title":"PSM-SMOTE: propensity score matching and synthetic minority oversampling for handling unbalanced microbiome data.","authors":"Jeongsup Moon, Zhe Liu, Taesung Park","doi":"10.1007/s13258-025-01688-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Predictive models using microbiome data often suffer from covariate imbalance and class imbalance, biasing results. Propensity Score Matching (PSM) balances covariates but reduces sample size, while borderline synthetic minority oversampling technique (borderline-SMOTE) oversamples minority classes but can generate uninformative examples.</p><p><strong>Objective: </strong>To develop and evaluate PSM-SMOTE, a novel hybrid sampling method that integrates PSM and borderline-SMOTE to handle both covariate and class imbalance in microbiome data.</p><p><strong>Methods: </strong>We developed PSM-SMOTE, a three-step hybrid sampling algorithm for microbiome data: (1) PSM at four caliper levels to balance covariates, (2) selection of at least ten robust differential markers via seven statistical tests with false discovery rate correction, and (3) application of borderline-SMOTE on the marker-based distance matrix to oversample minority classes. We evaluated PSM-SMOTE on three publicly available microbiome case-control datasets: pancreatic ductal adenocarcinoma (PDAC), colorectal cancer (CRC), and obesity, using logistic regression (LR), random forest (RF), and support vector machine (SVM) classifiers. Performance was assessed via area under the ROC curve (AUC).</p><p><strong>Results: </strong>PSM-SMOTE improved test AUCs in multiple model-dataset combinations compared with using PSM alone. Notably, for the RF model, PSM-SMOTE consistently enhanced AUC across nearly all oversampling settings in the PDAC and obesity cohorts. For the SVM model, PSM-SMOTE also achieved a significant AUC increase in the CRC cohort. For the LR model, PSM-SMOTE showed modest improvement under strict matching.</p><p><strong>Conclusion: </strong>PSM-SMOTE effectively addresses dual imbalance in microbiome data and consistently enhances performance, providing a practical solution for imbalanced data analyses.</p>","PeriodicalId":12675,"journal":{"name":"Genes & genomics","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genes & genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s13258-025-01688-x","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Predictive models using microbiome data often suffer from covariate imbalance and class imbalance, biasing results. Propensity Score Matching (PSM) balances covariates but reduces sample size, while borderline synthetic minority oversampling technique (borderline-SMOTE) oversamples minority classes but can generate uninformative examples.
Objective: To develop and evaluate PSM-SMOTE, a novel hybrid sampling method that integrates PSM and borderline-SMOTE to handle both covariate and class imbalance in microbiome data.
Methods: We developed PSM-SMOTE, a three-step hybrid sampling algorithm for microbiome data: (1) PSM at four caliper levels to balance covariates, (2) selection of at least ten robust differential markers via seven statistical tests with false discovery rate correction, and (3) application of borderline-SMOTE on the marker-based distance matrix to oversample minority classes. We evaluated PSM-SMOTE on three publicly available microbiome case-control datasets: pancreatic ductal adenocarcinoma (PDAC), colorectal cancer (CRC), and obesity, using logistic regression (LR), random forest (RF), and support vector machine (SVM) classifiers. Performance was assessed via area under the ROC curve (AUC).
Results: PSM-SMOTE improved test AUCs in multiple model-dataset combinations compared with using PSM alone. Notably, for the RF model, PSM-SMOTE consistently enhanced AUC across nearly all oversampling settings in the PDAC and obesity cohorts. For the SVM model, PSM-SMOTE also achieved a significant AUC increase in the CRC cohort. For the LR model, PSM-SMOTE showed modest improvement under strict matching.
Conclusion: PSM-SMOTE effectively addresses dual imbalance in microbiome data and consistently enhances performance, providing a practical solution for imbalanced data analyses.
期刊介绍:
Genes & Genomics is an official journal of the Korean Genetics Society (http://kgenetics.or.kr/). Although it is an official publication of the Genetics Society of Korea, membership of the Society is not required for contributors. It is a peer-reviewed international journal publishing print (ISSN 1976-9571) and online version (E-ISSN 2092-9293). It covers all disciplines of genetics and genomics from prokaryotes to eukaryotes from fundamental heredity to molecular aspects. The articles can be reviews, research articles, and short communications.