Daniël C de Reus, R Harmen Kuijten, Priyanshu Saha, Diego A Abelleyra Lastoria, Aliénor Warr-Esser, Charles F C Taylor, Olivier Q Groot, Darren Lui, Jorrit-Jan Verlaan, Daniel G Tobert
{"title":"External validation of a machine learning prediction model for massive blood loss during surgery for spinal metastases: a multi-institutional study using 880 patients.","authors":"Daniël C de Reus, R Harmen Kuijten, Priyanshu Saha, Diego A Abelleyra Lastoria, Aliénor Warr-Esser, Charles F C Taylor, Olivier Q Groot, Darren Lui, Jorrit-Jan Verlaan, Daniel G Tobert","doi":"10.1016/j.spinee.2025.03.018","DOIUrl":null,"url":null,"abstract":"<p><strong>Background context: </strong>A machine learning (ML) model was recently developed to predict massive intraoperative blood loss (>2500mL) during posterior decompressive surgery for spinal metastasis that performed well on external validation within the same region in China.</p><p><strong>Purpose: </strong>We sought to externally validate this model across new geographic regions (North America and Europe) and patient cohorts.</p><p><strong>Study design: </strong>Multi-institutional retrospective cohort study PATIENT SAMPLE: We retrospectively included patients 18 years or older who underwent decompressive surgery for spinal metastasis across three institutions in the United States, the United Kingdom and the Netherlands between 2016 and 2022. Inclusion and exclusion criteria were consistent with the development study with additional inclusion of (1) patients undergoing palliative decompression without stabilization, (2) patients with multiple myeloma and lymphoma, and (3) patients who continued anticoagulants perioperatively.</p><p><strong>Outcome measures: </strong>Model performance was assessed by comparing the incidence of massive intraoperative blood loss (>2,500mL) in our cohort to the predicted risk generated by the ML model. Blood loss was quantified in 7 ways (including the formula from the development study) as no gold standard exists, and the method in the development paper was not clearly defined. We estimated blood loss using the anesthesia report, and calculated it using transfusion data, and preoperative and postoperative hematocrit levels.</p><p><strong>Methods: </strong>The following five input variables necessary for risk calculation by the ML model were manually collected: tumor type, smoking status, ECOG score, surgical process, and preoperative platelet count. Model performance was assessed on overall fit (Brier score), discriminatory ability (area under the curve (AUC)), calibration (intercept & slope), and clinical utility (decision curve analysis (DCA)) for the total validation cohort, and for the North American and European cohorts separately. A sub-analysis, excluding the additional included patient groups, assessed the predictive model's performance with the same inclusion and exclusion criteria as the development cohort.</p><p><strong>Results: </strong>A total of 880 patients were included with a massive blood loss range from 5.3% to 18% depending on which quantification method was used. Using the most favorable quantification method, the predictive model overestimated risk in our total validation cohort and scored poorly on overall fit (Brier score: 0.278), discrimination (AUC: 0.631 [95%CI: 0.583, 0.680]), calibration, (intercept: -2.082, [95%CI: -2.285, -1.879]), slope: 0.283 [95%CI: 0.173, 0.393]), and clinical utility, with net harm observed in decision curve analysis from 20%. Similar poor performance results were observed in the sub-analysis excluding the additional included patients (n=676) and when analyzing the North American (n=539) and European (n=341) cohorts separately.</p><p><strong>Conclusions: </strong>To our knowledge, this is the first published external validation of a predictive ML model within orthopedic surgery to demonstrate poor performance. This poor performance might be attributed to overfitting and sampling bias as the development cohort had an insufficient sample size, and distributional shift as our cohort had key differences in predictive variables used by the model. These findings emphasize the importance of extensive validation in different geographical areas and addressing biases and known pitfalls of ML model development before clinical implementation, as untested models may do more harm than good.</p>","PeriodicalId":49484,"journal":{"name":"Spine Journal","volume":" ","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.spinee.2025.03.018","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
摘要
背景情况:目的:我们试图在新的地理区域(北美和欧洲)和患者队列中对该模型进行外部验证:研究设计:多机构回顾性队列研究:我们回顾性纳入了 2016 年至 2022 年期间在美国、英国和荷兰三家机构接受脊柱转移减压手术的 18 岁或以上患者。纳入和排除标准与开发研究一致,但额外纳入了以下患者:(1)接受姑息性减压手术但未稳定病情的患者;(2)多发性骨髓瘤和淋巴瘤患者;(3)围手术期继续服用抗凝剂的患者:通过比较我们队列中术中大量失血(>2,500 毫升)的发生率和 ML 模型生成的预测风险来评估模型性能。由于没有金标准,而且开发论文中的方法也没有明确定义,因此我们用 7 种方法(包括开发研究中的公式)量化失血量。我们使用麻醉报告估算失血量,并使用输血数据、术前和术后血细胞比容水平计算失血量:人工收集了 ML 模型计算风险所需的以下五个输入变量:肿瘤类型、吸烟状况、ECOG 评分、手术过程和术前血小板计数。对整个验证队列以及北美和欧洲队列的总体拟合度(布赖尔评分)、判别能力(曲线下面积(AUC))、校准(截距和斜率)和临床实用性(决策曲线分析(DCA))进行了评估。一项子分析排除了额外纳入的患者群体,按照与开发队列相同的纳入和排除标准评估了预测模型的性能:结果:共纳入了 880 名患者,大失血率从 5.3% 到 18% 不等,取决于采用哪种量化方法。0.283[95%CI:0.173, 0.393])和临床效用,在决策曲线分析中观察到的净损害为 20%。在排除其他纳入患者(n=676)的子分析中,以及在分别分析北美队列(n=539)和欧洲队列(n=341)时,也观察到了类似的不良结果:据我们所知,这是首次对骨科手术中的预测性 ML 模型进行外部验证,结果显示其性能较差。性能不佳的原因可能是过度拟合和抽样偏差,因为开发队列的样本量不足;也可能是分布偏移,因为我们的队列在模型使用的预测变量方面存在关键差异。这些发现强调了在不同地区进行广泛验证的重要性,以及在临床应用前解决 ML 模型开发中的偏差和已知缺陷的重要性,因为未经测试的模型可能弊大于利。
External validation of a machine learning prediction model for massive blood loss during surgery for spinal metastases: a multi-institutional study using 880 patients.
Background context: A machine learning (ML) model was recently developed to predict massive intraoperative blood loss (>2500mL) during posterior decompressive surgery for spinal metastasis that performed well on external validation within the same region in China.
Purpose: We sought to externally validate this model across new geographic regions (North America and Europe) and patient cohorts.
Study design: Multi-institutional retrospective cohort study PATIENT SAMPLE: We retrospectively included patients 18 years or older who underwent decompressive surgery for spinal metastasis across three institutions in the United States, the United Kingdom and the Netherlands between 2016 and 2022. Inclusion and exclusion criteria were consistent with the development study with additional inclusion of (1) patients undergoing palliative decompression without stabilization, (2) patients with multiple myeloma and lymphoma, and (3) patients who continued anticoagulants perioperatively.
Outcome measures: Model performance was assessed by comparing the incidence of massive intraoperative blood loss (>2,500mL) in our cohort to the predicted risk generated by the ML model. Blood loss was quantified in 7 ways (including the formula from the development study) as no gold standard exists, and the method in the development paper was not clearly defined. We estimated blood loss using the anesthesia report, and calculated it using transfusion data, and preoperative and postoperative hematocrit levels.
Methods: The following five input variables necessary for risk calculation by the ML model were manually collected: tumor type, smoking status, ECOG score, surgical process, and preoperative platelet count. Model performance was assessed on overall fit (Brier score), discriminatory ability (area under the curve (AUC)), calibration (intercept & slope), and clinical utility (decision curve analysis (DCA)) for the total validation cohort, and for the North American and European cohorts separately. A sub-analysis, excluding the additional included patient groups, assessed the predictive model's performance with the same inclusion and exclusion criteria as the development cohort.
Results: A total of 880 patients were included with a massive blood loss range from 5.3% to 18% depending on which quantification method was used. Using the most favorable quantification method, the predictive model overestimated risk in our total validation cohort and scored poorly on overall fit (Brier score: 0.278), discrimination (AUC: 0.631 [95%CI: 0.583, 0.680]), calibration, (intercept: -2.082, [95%CI: -2.285, -1.879]), slope: 0.283 [95%CI: 0.173, 0.393]), and clinical utility, with net harm observed in decision curve analysis from 20%. Similar poor performance results were observed in the sub-analysis excluding the additional included patients (n=676) and when analyzing the North American (n=539) and European (n=341) cohorts separately.
Conclusions: To our knowledge, this is the first published external validation of a predictive ML model within orthopedic surgery to demonstrate poor performance. This poor performance might be attributed to overfitting and sampling bias as the development cohort had an insufficient sample size, and distributional shift as our cohort had key differences in predictive variables used by the model. These findings emphasize the importance of extensive validation in different geographical areas and addressing biases and known pitfalls of ML model development before clinical implementation, as untested models may do more harm than good.
期刊介绍:
The Spine Journal, the official journal of the North American Spine Society, is an international and multidisciplinary journal that publishes original, peer-reviewed articles on research and treatment related to the spine and spine care, including basic science and clinical investigations. It is a condition of publication that manuscripts submitted to The Spine Journal have not been published, and will not be simultaneously submitted or published elsewhere. The Spine Journal also publishes major reviews of specific topics by acknowledged authorities, technical notes, teaching editorials, and other special features, Letters to the Editor-in-Chief are encouraged.