Development and validation of an interpretable machine learning model for predicting Philadelphia chromosome-positive acute lymphoblastic leukaemia using clinical and laboratory parameters: a single-centre retrospective study.

IF 2.4 3区医学 Q1 MEDICINE, GENERAL & INTERNAL

BMJ Open Pub Date : 2025-06-27 DOI:10.1136/bmjopen-2024-097526

Wuchen Yang, Jingya Liu, Yang Gou, Xingqin Huang, Maoshan Chen, Dezhi Huang, Shengwang Wu, Jing Zhang, Cheng Zhang, Shuiqing Liu, Xiangui Peng, Xi Zhang

{"title":"Development and validation of an interpretable machine learning model for predicting Philadelphia chromosome-positive acute lymphoblastic leukaemia using clinical and laboratory parameters: a single-centre retrospective study.","authors":"Wuchen Yang, Jingya Liu, Yang Gou, Xingqin Huang, Maoshan Chen, Dezhi Huang, Shengwang Wu, Jing Zhang, Cheng Zhang, Shuiqing Liu, Xiangui Peng, Xi Zhang","doi":"10.1136/bmjopen-2024-097526","DOIUrl":null,"url":null,"abstract":"Objective: To develop and validate a prediction model of Philadelphia chromosome-positive acute lymphoblastic leukaemia (Ph+ALL).Design: A single-centre retrospective study.Participants: This study analysed 471 newly diagnosed patients with ALL at the Second Affiliated Hospital of Army Medical University from January 2014 to December 2023.Methods: Clinical and laboratory parameters were collected, and the important characteristic parameters were selected using BorutaShap. Multiple machine learning (ML) models were constructed and optimised by using the active learning (AL) algorithm. Performance was evaluated using the area under the curve (AUC), comprehensive indicators and decision curve analysis. The interpretability of the model was evaluated by using SHapley Additive Interpretation (SHAP), and external validation was conducted on an independent test cohort.Results: 10 parameters were selected to construct multiple ML models. The CatBoost model integrated with an AL algorithm (CatBoost-AL) was found to be the most effective model for predicting Ph+ALL within the validation data set. This model achieved an AUC of 0.797 (95% CI 0.710 to 0.884), along with sensitivity, specificity and F1 score of 0.667, 0.864 and 0.777, respectively. The prediction performance of CatBoost-AL was further validated with an external testing set, where it maintained a strong AUC of 0.794 (95% CI 0.707 to 0.881). Using SHAP for global interpretability analysis, age, monocyte count, γ-glutamyl transferase, neutrophil count and alanine aminotransferase were identified as crucial parameters that significantly influence the diagnostic accuracy of CatBoost-AL.Conclusion: An interpretable ML model and online prediction tool were developed to determine whether newly diagnosed patients with ALL are Ph+ALL. The key parameters identified by the optimal model provided a further understanding of Ph+ALL characteristics and were valuable for accurate diagnosis and treatment of Ph+ALL.","PeriodicalId":9158,"journal":{"name":"BMJ Open","volume":"15 6","pages":"e097526"},"PeriodicalIF":2.4000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Open","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bmjopen-2024-097526","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To develop and validate a prediction model of Philadelphia chromosome-positive acute lymphoblastic leukaemia (Ph+ALL).

Design: A single-centre retrospective study.

Participants: This study analysed 471 newly diagnosed patients with ALL at the Second Affiliated Hospital of Army Medical University from January 2014 to December 2023.

Methods: Clinical and laboratory parameters were collected, and the important characteristic parameters were selected using BorutaShap. Multiple machine learning (ML) models were constructed and optimised by using the active learning (AL) algorithm. Performance was evaluated using the area under the curve (AUC), comprehensive indicators and decision curve analysis. The interpretability of the model was evaluated by using SHapley Additive Interpretation (SHAP), and external validation was conducted on an independent test cohort.

Results: 10 parameters were selected to construct multiple ML models. The CatBoost model integrated with an AL algorithm (CatBoost-AL) was found to be the most effective model for predicting Ph+ALL within the validation data set. This model achieved an AUC of 0.797 (95% CI 0.710 to 0.884), along with sensitivity, specificity and F1 score of 0.667, 0.864 and 0.777, respectively. The prediction performance of CatBoost-AL was further validated with an external testing set, where it maintained a strong AUC of 0.794 (95% CI 0.707 to 0.881). Using SHAP for global interpretability analysis, age, monocyte count, γ-glutamyl transferase, neutrophil count and alanine aminotransferase were identified as crucial parameters that significantly influence the diagnostic accuracy of CatBoost-AL.

Conclusion: An interpretable ML model and online prediction tool were developed to determine whether newly diagnosed patients with ALL are Ph+ALL. The key parameters identified by the optimal model provided a further understanding of Ph+ALL characteristics and were valuable for accurate diagnosis and treatment of Ph+ALL.

查看原文本刊更多论文

利用临床和实验室参数预测费城染色体阳性急性淋巴细胞白血病的可解释机器学习模型的开发和验证：一项单中心回顾性研究。

目的：建立并验证费城染色体阳性急性淋巴细胞白血病（Ph+ALL）的预测模型。设计：单中心回顾性研究。参与者：本研究分析了2014年1月至2023年12月在陆军医科大学第二附属医院新诊断的471例ALL患者。方法：收集临床和实验室参数，并采用BorutaShap筛选重要特征参数。利用主动学习（AL）算法构建并优化了多个机器学习（ML）模型。采用曲线下面积（AUC）、综合指标和决策曲线分析法对绩效进行评价。采用SHapley加性解释（SHAP）评价模型的可解释性，并在独立的测试队列中进行外部验证。结果：选取10个参数构建多个ML模型。结合人工智能算法（CatBoost-AL）的CatBoost模型是预测验证数据集中Ph+ALL最有效的模型。该模型的AUC为0.797 (95% CI 0.710 ~ 0.884)，敏感性、特异性和F1评分分别为0.667、0.864和0.777。CatBoost-AL的预测性能通过外部测试集进一步验证，其AUC保持在0.794 （95% CI 0.707至0.881）。使用SHAP进行全球可解释性分析，年龄、单核细胞计数、γ-谷氨酰转移酶、中性粒细胞计数和丙氨酸转氨酶被确定为显著影响CatBoost-AL诊断准确性的关键参数。结论：建立了一种可解释的ML模型和在线预测工具，以确定新诊断的ALL患者是否为Ph+ALL。通过优化模型确定的关键参数有助于进一步了解Ph+ALL的特征，对Ph+ALL的准确诊断和治疗具有重要价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMJ Open MEDICINE, GENERAL & INTERNAL-

CiteScore

4.40

自引率

3.40%

发文量

4510

审稿时长

2-3 weeks

期刊介绍： BMJ Open is an online, open access journal, dedicated to publishing medical research from all disciplines and therapeutic areas. The journal publishes all research study types, from study protocols to phase I trials to meta-analyses, including small or specialist studies. Publishing procedures are built around fully open peer review and continuous publication, publishing research online as soon as the article is ready.