Lulu Zhang, Shaokui Hua, Yu Zhang, Yan Jiang, Qunlian Huang, Baoyuan Chang, Dengke Li
{"title":"Construction and validation of an interpretable XGBoost machine learning model to predict ESBL positivity rates based on urinalysis data.","authors":"Lulu Zhang, Shaokui Hua, Yu Zhang, Yan Jiang, Qunlian Huang, Baoyuan Chang, Dengke Li","doi":"10.1007/s10096-025-05155-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Microbiological culture and drug susceptibility testing of urine samples have lengthy turnaround times, increasing the risk of extended-spectrum β-lactamase (ESBL)-positive urinary tract infection (UTI) patients progressing to sepsis.</p><p><strong>Objective: </strong>To develop an efficient machine learning model for the identification of ESBL positivity in UTI patients.</p><p><strong>Methods: </strong>This retrospective study included 528 samples that had undergone drug susceptibility testing, based on inclusion and exclusion criteria. Variables were screened using Lasso regression, with 70% of the samples used to construct nine machine learning models (XGBClassifier, LogisticRegression, LGBMClassifier, AdaBoostClassifier, SVC, MLPClassifier, ComplementNB, GaussianNB, and GradientBoostingClassifier). Model selection was based on criteria including accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1 score, Kappa score, and Area Under the Curve (AUC). The best model type was identified through ten-fold cross-validation, which was then built using the remaining 30% of the data as a test set. Interpretations of predictive results were provided using the SHAP model, clarifying the impact of each feature on predictions and enhancing model transparency and interpretability.</p><p><strong>Results: </strong>The variables selected by the Lasso regression model are as follows: gender + urinary protein + urobilinogen + leukocytes + occult blood + age + pH + specific gravity + leukocyte count + erythrocyte count + epithelial cell count + cast count.The XGBoost model outperformed others in ten-fold cross-validation, with scores on the validation set as follows: AUC (95%CI): 0.924 (0.860-0.989); cutoff: 0.664(0.637-0.690); accuracy: 0.862(0.839-0.885); sensitivity: 0.9(0.879-0.920); specificity: 0.725(0.618-0.832); PPV: 0.923(0.896-0.950); NPV: 0.667(0.626-0.707); F1 score: 0.911(0.896-0.925); Kappa: 0.603(0.527-0.679). The final model achieved an AUC of 0.968 and accuracy of 0.943 on the test set.</p><p><strong>Conclusion: </strong>This study developed a rapid and efficient machine learning model capable of identifying ESBL positivity based solely on routine urine test data.</p>","PeriodicalId":11782,"journal":{"name":"European Journal of Clinical Microbiology & Infectious Diseases","volume":" ","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Clinical Microbiology & Infectious Diseases","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10096-025-05155-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFECTIOUS DISEASES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Microbiological culture and drug susceptibility testing of urine samples have lengthy turnaround times, increasing the risk of extended-spectrum β-lactamase (ESBL)-positive urinary tract infection (UTI) patients progressing to sepsis.
Objective: To develop an efficient machine learning model for the identification of ESBL positivity in UTI patients.
Methods: This retrospective study included 528 samples that had undergone drug susceptibility testing, based on inclusion and exclusion criteria. Variables were screened using Lasso regression, with 70% of the samples used to construct nine machine learning models (XGBClassifier, LogisticRegression, LGBMClassifier, AdaBoostClassifier, SVC, MLPClassifier, ComplementNB, GaussianNB, and GradientBoostingClassifier). Model selection was based on criteria including accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1 score, Kappa score, and Area Under the Curve (AUC). The best model type was identified through ten-fold cross-validation, which was then built using the remaining 30% of the data as a test set. Interpretations of predictive results were provided using the SHAP model, clarifying the impact of each feature on predictions and enhancing model transparency and interpretability.
Results: The variables selected by the Lasso regression model are as follows: gender + urinary protein + urobilinogen + leukocytes + occult blood + age + pH + specific gravity + leukocyte count + erythrocyte count + epithelial cell count + cast count.The XGBoost model outperformed others in ten-fold cross-validation, with scores on the validation set as follows: AUC (95%CI): 0.924 (0.860-0.989); cutoff: 0.664(0.637-0.690); accuracy: 0.862(0.839-0.885); sensitivity: 0.9(0.879-0.920); specificity: 0.725(0.618-0.832); PPV: 0.923(0.896-0.950); NPV: 0.667(0.626-0.707); F1 score: 0.911(0.896-0.925); Kappa: 0.603(0.527-0.679). The final model achieved an AUC of 0.968 and accuracy of 0.943 on the test set.
Conclusion: This study developed a rapid and efficient machine learning model capable of identifying ESBL positivity based solely on routine urine test data.
期刊介绍:
EJCMID is an interdisciplinary journal devoted to the publication of communications on infectious diseases of bacterial, viral and parasitic origin.