{"title":"Developing an explainable machine learning model to predict false-negative citrin deficiency cases in newborn screening.","authors":"Peiyao Wang, Haomin Li, Xinjie Yang, Lingwei Hu, Yuhe Chen, Ziyan Cen, Pingping Ge, Qimin He, Benqing Wu, Xinwen Huang","doi":"10.1186/s13023-025-04045-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Neonatal Intrahepatic Cholestasis caused by Citrin Deficiency (NICCD) is an autosomal recessive disorder affecting the urea cycle and energy metabolism. Newborn screening (NBS) usually relies on elevated citrulline, but some patients have normal citrulline, resulting in false negatives and delayed diagnosis. This study develops an explainable machine learning (ML) model to predict false-negative NICCD cases during NBS.</p><p><strong>Methods: </strong>Data from 53 false-negative NICCD patients and 212 controls, collected retrospectively between 2011 and 2024, were analyzed. The dataset was split into a training set (70%) and a test set (30%). External validation involved 48 participants from distinct time periods. Key predictors were identified using variable importance in projection (VIP > 1) and Lasso regression. Six ML models were trained for evaluation: Logistic Regression, Random Forest, Light Gradient Boosting Machine, Extreme Gradient Boosting (XGBoost), K-Nearest Neighbor, and Support Vector Machines. Performance was evaluated using the area under the receiver operating characteristic curve (AUC) and F1 score. Shapley Additive exPlanations (SHAP) was applied to determine the importance of features and interpret the models.</p><p><strong>Results: </strong>Birth weight, citrulline, glycine, phenylalanine, ornithine, arginine, proline, succinylacetone, and C10:2 were selected as predictive features. Among the ML models, XGBoost demonstrated the most robust and consistent performance, achieving AUCs of 0.971(95%CI: 0.959-0.979), 0.968, and 0.977, and F1 scores of 0.786(95% CI: 0.744-0.820), 0.828, and 0.833 in the training, test, and external validation sets, respectively. SHAP analysis showed that the most important features are citrulline, glycine, phenylalanine, succinylacetone, birth weight, and ornithine. Feature pairs such as citrulline-phenylalanine, citrulline-glycine, succinylacetone-birth weight, and ornithine-glycine showed varying interactions. SHAP force plots, decision plots, and waterfall plots provided insightful patient-level interpretations. Finally, we built a network calculator for the prediction of false-negative NICCD cases ( https://myapp123.shinyapps.io/my_shiny_app/ ).</p><p><strong>Conclusion: </strong>An interpretable machine learning model utilizing metabolite and demographic data enhances the detection of false-negative NICCD cases, facilitates early identification and intervention, and ultimately improves the overall effectiveness of the newborn screening system.</p>","PeriodicalId":19651,"journal":{"name":"Orphanet Journal of Rare Diseases","volume":"20 1","pages":"507"},"PeriodicalIF":3.5000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Orphanet Journal of Rare Diseases","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13023-025-04045-z","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Neonatal Intrahepatic Cholestasis caused by Citrin Deficiency (NICCD) is an autosomal recessive disorder affecting the urea cycle and energy metabolism. Newborn screening (NBS) usually relies on elevated citrulline, but some patients have normal citrulline, resulting in false negatives and delayed diagnosis. This study develops an explainable machine learning (ML) model to predict false-negative NICCD cases during NBS.
Methods: Data from 53 false-negative NICCD patients and 212 controls, collected retrospectively between 2011 and 2024, were analyzed. The dataset was split into a training set (70%) and a test set (30%). External validation involved 48 participants from distinct time periods. Key predictors were identified using variable importance in projection (VIP > 1) and Lasso regression. Six ML models were trained for evaluation: Logistic Regression, Random Forest, Light Gradient Boosting Machine, Extreme Gradient Boosting (XGBoost), K-Nearest Neighbor, and Support Vector Machines. Performance was evaluated using the area under the receiver operating characteristic curve (AUC) and F1 score. Shapley Additive exPlanations (SHAP) was applied to determine the importance of features and interpret the models.
Results: Birth weight, citrulline, glycine, phenylalanine, ornithine, arginine, proline, succinylacetone, and C10:2 were selected as predictive features. Among the ML models, XGBoost demonstrated the most robust and consistent performance, achieving AUCs of 0.971(95%CI: 0.959-0.979), 0.968, and 0.977, and F1 scores of 0.786(95% CI: 0.744-0.820), 0.828, and 0.833 in the training, test, and external validation sets, respectively. SHAP analysis showed that the most important features are citrulline, glycine, phenylalanine, succinylacetone, birth weight, and ornithine. Feature pairs such as citrulline-phenylalanine, citrulline-glycine, succinylacetone-birth weight, and ornithine-glycine showed varying interactions. SHAP force plots, decision plots, and waterfall plots provided insightful patient-level interpretations. Finally, we built a network calculator for the prediction of false-negative NICCD cases ( https://myapp123.shinyapps.io/my_shiny_app/ ).
Conclusion: An interpretable machine learning model utilizing metabolite and demographic data enhances the detection of false-negative NICCD cases, facilitates early identification and intervention, and ultimately improves the overall effectiveness of the newborn screening system.
期刊介绍:
Orphanet Journal of Rare Diseases is an open access, peer-reviewed journal that encompasses all aspects of rare diseases and orphan drugs. The journal publishes high-quality reviews on specific rare diseases. In addition, the journal may consider articles on clinical trial outcome reports, either positive or negative, and articles on public health issues in the field of rare diseases and orphan drugs. The journal does not accept case reports.