Yitayeh Belsti , Lisa Moran , Aya Mousa , Helena Teede , Joanne Enticott
{"title":"Evaluation of machine learning and logistic regression-based gestational diabetes prognostic models","authors":"Yitayeh Belsti , Lisa Moran , Aya Mousa , Helena Teede , Joanne Enticott","doi":"10.1016/j.jclinepi.2025.111957","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>This study aimed to follow best practice by temporally evaluating existing gestational diabetes mellitus (GDM) prediction models, updating them where needed, and comparing the temporal evaluation performance of the machine learning (ML)-based models with that of regression-based models.</div></div><div><h3>Study Design and Setting</h3><div>We utilized new data for the temporal validation dataset with 12,722 singleton pregnancies at the Monash Health Network from 2021 to 2022. The Monash GDM Logistic Regression (LR) model with six categorical variables (version 2) and the Monash GDM ML model (version 3), along with an extended LR GDM model (version 3), each with eight categorical and continuous variables, were evaluated. Model performance was assessed using discrimination and calibration. Decision curve analyses (DCA) were performed to determine the net benefit of models. Recalibration was considered to improve model performance.</div></div><div><h3>Results</h3><div>The development datasets for model versions 2, 3, and the new temporal validation dataset included 21.2%, 22.5%, and 33.5% of pregnant women aged ≥35 years, respectively; 22%, 23.7%, and 24.0% with a body mass index ≥30 kg/m<sup>2</sup>; and GDM prevalence rates of 18%, 21.3%, and 28.6%, respectively. There was similar discrimination performance across the models, with area under the receiver operating characteristic curve (AUC) of 0.72 [95% CI: 0.71, 0.73], 0.73 [95% CI: 0.72, 0.74], and 0.73 [95% CI: 0.73, 0.74] for version 2 and version 3 ML and LR models, respectively. All models exhibited overestimation with calibration slopes of 0.87, 0.99, and 0.87, respectively, which improved with recalibration. DCA showed that all models had better net benefits as compared to treat all and treat none. For all models, some variability has been observed in prediction performance across ethnic groups and parity.</div></div><div><h3>Conclusion</h3><div>Despite significant changes in the background characteristics of the population, we have demonstrated that all models remained robust, especially after recalibration. However, the performance of the original ML model decreased significantly during validation. Dynamic models are better suited to adapt to the temporal changes in baseline characteristics of pregnant women and the resulting calibration drift, as they can incorporate new data without requiring manual evaluation.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"187 ","pages":"Article 111957"},"PeriodicalIF":5.2000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895435625002902","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives
This study aimed to follow best practice by temporally evaluating existing gestational diabetes mellitus (GDM) prediction models, updating them where needed, and comparing the temporal evaluation performance of the machine learning (ML)-based models with that of regression-based models.
Study Design and Setting
We utilized new data for the temporal validation dataset with 12,722 singleton pregnancies at the Monash Health Network from 2021 to 2022. The Monash GDM Logistic Regression (LR) model with six categorical variables (version 2) and the Monash GDM ML model (version 3), along with an extended LR GDM model (version 3), each with eight categorical and continuous variables, were evaluated. Model performance was assessed using discrimination and calibration. Decision curve analyses (DCA) were performed to determine the net benefit of models. Recalibration was considered to improve model performance.
Results
The development datasets for model versions 2, 3, and the new temporal validation dataset included 21.2%, 22.5%, and 33.5% of pregnant women aged ≥35 years, respectively; 22%, 23.7%, and 24.0% with a body mass index ≥30 kg/m2; and GDM prevalence rates of 18%, 21.3%, and 28.6%, respectively. There was similar discrimination performance across the models, with area under the receiver operating characteristic curve (AUC) of 0.72 [95% CI: 0.71, 0.73], 0.73 [95% CI: 0.72, 0.74], and 0.73 [95% CI: 0.73, 0.74] for version 2 and version 3 ML and LR models, respectively. All models exhibited overestimation with calibration slopes of 0.87, 0.99, and 0.87, respectively, which improved with recalibration. DCA showed that all models had better net benefits as compared to treat all and treat none. For all models, some variability has been observed in prediction performance across ethnic groups and parity.
Conclusion
Despite significant changes in the background characteristics of the population, we have demonstrated that all models remained robust, especially after recalibration. However, the performance of the original ML model decreased significantly during validation. Dynamic models are better suited to adapt to the temporal changes in baseline characteristics of pregnant women and the resulting calibration drift, as they can incorporate new data without requiring manual evaluation.
期刊介绍:
The Journal of Clinical Epidemiology strives to enhance the quality of clinical and patient-oriented healthcare research by advancing and applying innovative methods in conducting, presenting, synthesizing, disseminating, and translating research results into optimal clinical practice. Special emphasis is placed on training new generations of scientists and clinical practice leaders.