推进子痫前期预测：一个定制的机器学习管道，集成了重采样和集成模型，用于处理不平衡的医疗数据。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-03-24 DOI:10.1186/s13040-025-00440-1

Yinyao Ma, Hanlin Lv, Yanhua Ma, Xiao Wang, Longting Lv, Xuxia Liang, Lei Wang

{"title":"推进子痫前期预测：一个定制的机器学习管道，集成了重采样和集成模型，用于处理不平衡的医疗数据。","authors":"Yinyao Ma, Hanlin Lv, Yanhua Ma, Xiao Wang, Longting Lv, Xuxia Liang, Lei Wang","doi":"10.1186/s13040-025-00440-1","DOIUrl":null,"url":null,"abstract":"Background: Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.Objective: This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.Methods: Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.Results: Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.Conclusions: This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"25"},"PeriodicalIF":6.1000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11934807/pdf/","citationCount":"0","resultStr":"{\"title\":\"Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.\",\"authors\":\"Yinyao Ma, Hanlin Lv, Yanhua Ma, Xiao Wang, Longting Lv, Xuxia Liang, Lei Wang\",\"doi\":\"10.1186/s13040-025-00440-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.Objective: This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.Methods: Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.Results: Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.Conclusions: This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":\"18 1\",\"pages\":\"25\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2025-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11934807/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-025-00440-1\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00440-1","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：在不平衡的医疗数据集（如子痫前期）中构建预测模型具有挑战性，尤其是在使用集合机器学习算法时：本研究旨在开发一个强大的管道，以提高集合机器学习模型的预测性能，从而在不平衡数据集中对子痫前期进行早期预测：我们的研究为在不平衡医疗数据集中进行子痫前期的早期预测建立了一个综合管道。我们收集了 2015 年至 2020 年广西人民医院孕妇的电子健康记录，并使用三个公共数据集进行了额外的外部验证。这种广泛的数据收集有助于通过结构化的评估过程，对各种重采样技术、不同的少数服从多数比率以及集合机器学习算法进行系统评估。我们根据 G-mean、MCC、AP 和 AUC 等性能指标分析了 4608 种模型设置组合，以确定最有效的配置。我们利用包括 OLS 回归、方差分析和 Kruskal-Wallis 检验在内的高级统计分析对这些设置进行了微调，从而提高了模型的性能和稳健性，以满足临床应用的需要：我们的分析证实，对变量进行系统的连续优化对我们模型的预测性能有重大影响。最有效的配置是利用反向加权高斯混杂模型进行重采样，并结合梯度提升决策树算法，优化后的少数服从多数比率为 0.09，几何平均数达到 0.6694（95% 置信区间：0.5855-0.7557）。这一配置在所有评估指标上都明显优于基线配置，表明模型性能有了大幅提高：本研究建立了一个稳健的管道，可显著提高不平衡数据集中子痫前期模型的预测性能。我们的研究结果强调了在医疗诊断中对变量进行战略性优化的重要性，为广泛应用于各种关注类不平衡的医疗环境提供了可能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.

查看原文本刊更多论文

Advancing preeclampsia prediction: a tailored machine learning pipeline integrating resampling and ensemble models for handling imbalanced medical data.

Background: Constructing a predictive model is challenging in imbalanced medical dataset (such as preeclampsia), particularly when employing ensemble machine learning algorithms.

Objective: This study aims to develop a robust pipeline that enhances the predictive performance of ensemble machine learning models for the early prediction of preeclampsia in an imbalanced dataset.

Methods: Our research establishes a comprehensive pipeline optimized for early preeclampsia prediction in imbalanced medical datasets. We gathered electronic health records from pregnant women at the People's Hospital of Guangxi from 2015 to 2020, with additional external validation using three public datasets. This extensive data collection facilitated the systematic assessment of various resampling techniques, varied minority-to-majority ratios, and ensemble machine learning algorithms through a structured evaluation process. We analyzed 4,608 combinations of model settings against performance metrics such as G-mean, MCC, AP, and AUC to determine the most effective configurations. Advanced statistical analyses including OLS regression, ANOVA, and Kruskal-Wallis tests were utilized to fine-tune these settings, enhancing model performance and robustness for clinical application.

Results: Our analysis confirmed the significant impact of systematic sequential optimization of variables on the predictive performance of our models. The most effective configuration utilized the Inverse Weighted Gaussian Mixture Model for resampling, combined with Gradient Boosting Decision Trees algorithm, and an optimized minority-to-majority ratio of 0.09, achieving a Geometric Mean of 0.6694 (95% confidence interval: 0.5855-0.7557). This configuration significantly outperformed the baseline across all evaluated metrics, demonstrating substantial improvements in model performance.

Conclusions: This study establishes a robust pipeline that significantly enhances the predictive performance of models for preeclampsia within imbalanced datasets. Our findings underscore the importance of a strategic approach to variable optimization in medical diagnostics, offering potential for broad application in various medical contexts where class imbalance is a concern.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.