External validation of a machine learning prediction model for massive blood loss during surgery for spinal metastases: a multi-institutional study using 880 patients.
Daniël C de Reus, R Harmen Kuijten, Priyanshu Saha, Diego A Abelleyra Lastoria, Aliénor Warr-Esser, Charles F C Taylor, Olivier Q Groot, Darren Lui, Jorrit-Jan Verlaan, Daniel G Tobert
{"title":"External validation of a machine learning prediction model for massive blood loss during surgery for spinal metastases: a multi-institutional study using 880 patients.","authors":"Daniël C de Reus, R Harmen Kuijten, Priyanshu Saha, Diego A Abelleyra Lastoria, Aliénor Warr-Esser, Charles F C Taylor, Olivier Q Groot, Darren Lui, Jorrit-Jan Verlaan, Daniel G Tobert","doi":"10.1016/j.spinee.2025.03.018","DOIUrl":null,"url":null,"abstract":"<p><strong>Background context: </strong>A machine learning (ML) model was recently developed to predict massive intraoperative blood loss (>2500mL) during posterior decompressive surgery for spinal metastasis that performed well on external validation within the same region in China.</p><p><strong>Purpose: </strong>We sought to externally validate this model across new geographic regions (North America and Europe) and patient cohorts.</p><p><strong>Study design: </strong>Multi-institutional retrospective cohort study PATIENT SAMPLE: We retrospectively included patients 18 years or older who underwent decompressive surgery for spinal metastasis across three institutions in the United States, the United Kingdom and the Netherlands between 2016 and 2022. Inclusion and exclusion criteria were consistent with the development study with additional inclusion of (1) patients undergoing palliative decompression without stabilization, (2) patients with multiple myeloma and lymphoma, and (3) patients who continued anticoagulants perioperatively.</p><p><strong>Outcome measures: </strong>Model performance was assessed by comparing the incidence of massive intraoperative blood loss (>2,500mL) in our cohort to the predicted risk generated by the ML model. Blood loss was quantified in 7 ways (including the formula from the development study) as no gold standard exists, and the method in the development paper was not clearly defined. We estimated blood loss using the anesthesia report, and calculated it using transfusion data, and preoperative and postoperative hematocrit levels.</p><p><strong>Methods: </strong>The following five input variables necessary for risk calculation by the ML model were manually collected: tumor type, smoking status, ECOG score, surgical process, and preoperative platelet count. Model performance was assessed on overall fit (Brier score), discriminatory ability (area under the curve (AUC)), calibration (intercept & slope), and clinical utility (decision curve analysis (DCA)) for the total validation cohort, and for the North American and European cohorts separately. A sub-analysis, excluding the additional included patient groups, assessed the predictive model's performance with the same inclusion and exclusion criteria as the development cohort.</p><p><strong>Results: </strong>A total of 880 patients were included with a massive blood loss range from 5.3% to 18% depending on which quantification method was used. Using the most favorable quantification method, the predictive model overestimated risk in our total validation cohort and scored poorly on overall fit (Brier score: 0.278), discrimination (AUC: 0.631 [95%CI: 0.583, 0.680]), calibration, (intercept: -2.082, [95%CI: -2.285, -1.879]), slope: 0.283 [95%CI: 0.173, 0.393]), and clinical utility, with net harm observed in decision curve analysis from 20%. Similar poor performance results were observed in the sub-analysis excluding the additional included patients (n=676) and when analyzing the North American (n=539) and European (n=341) cohorts separately.</p><p><strong>Conclusions: </strong>To our knowledge, this is the first published external validation of a predictive ML model within orthopedic surgery to demonstrate poor performance. This poor performance might be attributed to overfitting and sampling bias as the development cohort had an insufficient sample size, and distributional shift as our cohort had key differences in predictive variables used by the model. These findings emphasize the importance of extensive validation in different geographical areas and addressing biases and known pitfalls of ML model development before clinical implementation, as untested models may do more harm than good.</p>","PeriodicalId":49484,"journal":{"name":"Spine Journal","volume":" ","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.spinee.2025.03.018","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background context: A machine learning (ML) model was recently developed to predict massive intraoperative blood loss (>2500mL) during posterior decompressive surgery for spinal metastasis that performed well on external validation within the same region in China.
Purpose: We sought to externally validate this model across new geographic regions (North America and Europe) and patient cohorts.
Study design: Multi-institutional retrospective cohort study PATIENT SAMPLE: We retrospectively included patients 18 years or older who underwent decompressive surgery for spinal metastasis across three institutions in the United States, the United Kingdom and the Netherlands between 2016 and 2022. Inclusion and exclusion criteria were consistent with the development study with additional inclusion of (1) patients undergoing palliative decompression without stabilization, (2) patients with multiple myeloma and lymphoma, and (3) patients who continued anticoagulants perioperatively.
Outcome measures: Model performance was assessed by comparing the incidence of massive intraoperative blood loss (>2,500mL) in our cohort to the predicted risk generated by the ML model. Blood loss was quantified in 7 ways (including the formula from the development study) as no gold standard exists, and the method in the development paper was not clearly defined. We estimated blood loss using the anesthesia report, and calculated it using transfusion data, and preoperative and postoperative hematocrit levels.
Methods: The following five input variables necessary for risk calculation by the ML model were manually collected: tumor type, smoking status, ECOG score, surgical process, and preoperative platelet count. Model performance was assessed on overall fit (Brier score), discriminatory ability (area under the curve (AUC)), calibration (intercept & slope), and clinical utility (decision curve analysis (DCA)) for the total validation cohort, and for the North American and European cohorts separately. A sub-analysis, excluding the additional included patient groups, assessed the predictive model's performance with the same inclusion and exclusion criteria as the development cohort.
Results: A total of 880 patients were included with a massive blood loss range from 5.3% to 18% depending on which quantification method was used. Using the most favorable quantification method, the predictive model overestimated risk in our total validation cohort and scored poorly on overall fit (Brier score: 0.278), discrimination (AUC: 0.631 [95%CI: 0.583, 0.680]), calibration, (intercept: -2.082, [95%CI: -2.285, -1.879]), slope: 0.283 [95%CI: 0.173, 0.393]), and clinical utility, with net harm observed in decision curve analysis from 20%. Similar poor performance results were observed in the sub-analysis excluding the additional included patients (n=676) and when analyzing the North American (n=539) and European (n=341) cohorts separately.
Conclusions: To our knowledge, this is the first published external validation of a predictive ML model within orthopedic surgery to demonstrate poor performance. This poor performance might be attributed to overfitting and sampling bias as the development cohort had an insufficient sample size, and distributional shift as our cohort had key differences in predictive variables used by the model. These findings emphasize the importance of extensive validation in different geographical areas and addressing biases and known pitfalls of ML model development before clinical implementation, as untested models may do more harm than good.
期刊介绍:
The Spine Journal, the official journal of the North American Spine Society, is an international and multidisciplinary journal that publishes original, peer-reviewed articles on research and treatment related to the spine and spine care, including basic science and clinical investigations. It is a condition of publication that manuscripts submitted to The Spine Journal have not been published, and will not be simultaneously submitted or published elsewhere. The Spine Journal also publishes major reviews of specific topics by acknowledged authorities, technical notes, teaching editorials, and other special features, Letters to the Editor-in-Chief are encouraged.