偏见还是最适合？单模型机器学习预测骨肉瘤生存结果SEER和NCDB数据集的比较分析

IF 4.4 2区医学 Q1 ORTHOPEDICS

Clinical Orthopaedics and Related Research® Pub Date : 2025-09-23 DOI:10.1097/corr.0000000000003701

Andrew G Girgis,Bishoy M Galoaa,Megan H Goh,Marcos R Gonzalez,Santiago A Lozano-Calderón

{"title":"偏见还是最适合？单模型机器学习预测骨肉瘤生存结果SEER和NCDB数据集的比较分析","authors":"Andrew G Girgis,Bishoy M Galoaa,Megan H Goh,Marcos R Gonzalez,Santiago A Lozano-Calderón","doi":"10.1097/corr.0000000000003701","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nMachine-learning models are increasingly used in orthopaedic oncology to predict survival outcomes for patients with osteosarcoma. Typically, these models are trained on a single data set, such as the Surveillance, Epidemiology, and End Results (SEER) or the National Cancer Database (NCDB). However, because any single database, even if it is large, may emphasize different data points and may include errors, models trained on single data sets may learn database-specific patterns rather than generalizable clinical relationships, limiting their clinical utility when applied to different patient populations.\r\n\r\nQUESTIONS/PURPOSES\r\nWe developed separate machine-learning models using SEER and NCDB databases and (1) compared the accuracy of SEER- and NCDB-trained models in estimating 2- and 5-year overall survival when validated on their respective databases, (2) assessed which database produced a more generalizable machine-learning model (defined as one that maintains high performance when applied to unseen external data) by using the model trained on one database to externally validate the other, and (3) identified key factors contributing to prediction accuracy.\r\n\r\nMETHODS\r\nFrom 2000 to 2018 (SEER) and 2004 to 2018 (NCDB), we identified 15,241 SEER patients and 11,643 NCDB patients with osteosarcoma. After excluding patients with tumors outside the extremities/pelvis, including unconfirmed osteosarcoma histology results (52% [7989] SEER, 22% [2537] NCDB) and those with missing metastasis, treatment, or prognosis data (20% [2974] SEER, 43% [5057] NCDB), we included 4049 patients from NCDB and 4278 patients from SEER, all with confirmed osteosarcoma. SEER provides population-based coverage with detailed staging but limited treatment information, while NCDB offers hospital-based data with comprehensive treatment details. We developed separate models for each data set, randomly splitting each into training (80%) and validation (20%) sets. This separation was crucial because it allowed us to test how well our models performed on completely new, unseen data-to test whether a model will work in real-world clinical practice. Primary outcomes included accuracy (proportion of correct predictions), area under the receiver operating characteristic curve (AUC) (discriminative ability between survival outcomes, with values > 0.8 indicating good performance), Brier score (probabilistic prediction accuracy, with values < 0.25 indicating useful models), precision (proportion of positive predictions that were correct), recall (sensitivity for identifying actual outcomes), and F1 score (harmonic mean of precision and recall). The median patient age was 22 years in the NCDB versus 17 years in SEER (p = 0.005), with similar sex distributions (56% male in NCDB, 56% male in SEER) but different racial compositions and overall survival rates (72% and 52% at 2 and 5 years, respectively, for NCDB versus 65% and 43% for SEER).\r\n\r\nRESULTS\r\nInternal validation showed excellent performance: NCDB-trained models achieved an AUC of 0.93 (95% confidence interval [CI] 0.92 to 0.94) at 2 years and 0.91 (95% CI 0.90 to 0.92) at 5 years, while SEER-trained models achieved 0.90 (95% CI 0.89 to 0.91) and 0.92 (95% CI 0.91 to 0.92), respectively. These AUC values > 0.90 indicate excellent discriminative ability; the models can reliably distinguish between patients who will survive and those who will not. The small differences between NCDB and SEER models (95% CI 0.90 to 0.93) are not clinically meaningful given overlapping confidence intervals. However, external validation revealed poor transferability: NCDB models tested on SEER data achieved an AUC of 0.67 (95% CI 0.65 to 0.68) and 0.60 (95% CI 0.58 to 0.62), while SEER models tested on NCDB data achieved 0.61 (95% CI 0.59 to 0.62) and 0.56 (95% CI 0.55 to 0.58). These external validation AUC values < 0.70 indicate poor predictive performance (barely better than chance and unsuitable for clinical decision-making). This dramatic performance drop demonstrates that models cannot be reliably transferred between different healthcare databases. NCDB models prioritized treatment variables, while SEER models emphasized demographic factors, reflecting the different clinical information captured by each database and explaining why cross-database application fails.\r\n\r\nCONCLUSION\r\nModels should be validated within the same database environment where they will be applied. These results highlight differences between the NCDB and SEER data sets, showing that models learn database-specific patterns rather than generalizable disease patterns. Cross-database application of models leads to poor predictive performance and should be avoided without revalidation.\r\n\r\nLEVEL OF EVIDENCE\r\nLevel III, prognostic study.","PeriodicalId":10404,"journal":{"name":"Clinical Orthopaedics and Related Research®","volume":"97 8 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bias or Best Fit? A Comparative Analysis of the SEER and NCDB Data Sets in Single-model Machine Learning for Predicting Osteosarcoma Survival Outcomes.\",\"authors\":\"Andrew G Girgis,Bishoy M Galoaa,Megan H Goh,Marcos R Gonzalez,Santiago A Lozano-Calderón\",\"doi\":\"10.1097/corr.0000000000003701\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"BACKGROUND\\r\\nMachine-learning models are increasingly used in orthopaedic oncology to predict survival outcomes for patients with osteosarcoma. Typically, these models are trained on a single data set, such as the Surveillance, Epidemiology, and End Results (SEER) or the National Cancer Database (NCDB). However, because any single database, even if it is large, may emphasize different data points and may include errors, models trained on single data sets may learn database-specific patterns rather than generalizable clinical relationships, limiting their clinical utility when applied to different patient populations.\\r\\n\\r\\nQUESTIONS/PURPOSES\\r\\nWe developed separate machine-learning models using SEER and NCDB databases and (1) compared the accuracy of SEER- and NCDB-trained models in estimating 2- and 5-year overall survival when validated on their respective databases, (2) assessed which database produced a more generalizable machine-learning model (defined as one that maintains high performance when applied to unseen external data) by using the model trained on one database to externally validate the other, and (3) identified key factors contributing to prediction accuracy.\\r\\n\\r\\nMETHODS\\r\\nFrom 2000 to 2018 (SEER) and 2004 to 2018 (NCDB), we identified 15,241 SEER patients and 11,643 NCDB patients with osteosarcoma. After excluding patients with tumors outside the extremities/pelvis, including unconfirmed osteosarcoma histology results (52% [7989] SEER, 22% [2537] NCDB) and those with missing metastasis, treatment, or prognosis data (20% [2974] SEER, 43% [5057] NCDB), we included 4049 patients from NCDB and 4278 patients from SEER, all with confirmed osteosarcoma. SEER provides population-based coverage with detailed staging but limited treatment information, while NCDB offers hospital-based data with comprehensive treatment details. We developed separate models for each data set, randomly splitting each into training (80%) and validation (20%) sets. This separation was crucial because it allowed us to test how well our models performed on completely new, unseen data-to test whether a model will work in real-world clinical practice. Primary outcomes included accuracy (proportion of correct predictions), area under the receiver operating characteristic curve (AUC) (discriminative ability between survival outcomes, with values > 0.8 indicating good performance), Brier score (probabilistic prediction accuracy, with values < 0.25 indicating useful models), precision (proportion of positive predictions that were correct), recall (sensitivity for identifying actual outcomes), and F1 score (harmonic mean of precision and recall). The median patient age was 22 years in the NCDB versus 17 years in SEER (p = 0.005), with similar sex distributions (56% male in NCDB, 56% male in SEER) but different racial compositions and overall survival rates (72% and 52% at 2 and 5 years, respectively, for NCDB versus 65% and 43% for SEER).\\r\\n\\r\\nRESULTS\\r\\nInternal validation showed excellent performance: NCDB-trained models achieved an AUC of 0.93 (95% confidence interval [CI] 0.92 to 0.94) at 2 years and 0.91 (95% CI 0.90 to 0.92) at 5 years, while SEER-trained models achieved 0.90 (95% CI 0.89 to 0.91) and 0.92 (95% CI 0.91 to 0.92), respectively. These AUC values > 0.90 indicate excellent discriminative ability; the models can reliably distinguish between patients who will survive and those who will not. The small differences between NCDB and SEER models (95% CI 0.90 to 0.93) are not clinically meaningful given overlapping confidence intervals. However, external validation revealed poor transferability: NCDB models tested on SEER data achieved an AUC of 0.67 (95% CI 0.65 to 0.68) and 0.60 (95% CI 0.58 to 0.62), while SEER models tested on NCDB data achieved 0.61 (95% CI 0.59 to 0.62) and 0.56 (95% CI 0.55 to 0.58). These external validation AUC values < 0.70 indicate poor predictive performance (barely better than chance and unsuitable for clinical decision-making). This dramatic performance drop demonstrates that models cannot be reliably transferred between different healthcare databases. NCDB models prioritized treatment variables, while SEER models emphasized demographic factors, reflecting the different clinical information captured by each database and explaining why cross-database application fails.\\r\\n\\r\\nCONCLUSION\\r\\nModels should be validated within the same database environment where they will be applied. These results highlight differences between the NCDB and SEER data sets, showing that models learn database-specific patterns rather than generalizable disease patterns. Cross-database application of models leads to poor predictive performance and should be avoided without revalidation.\\r\\n\\r\\nLEVEL OF EVIDENCE\\r\\nLevel III, prognostic study.\",\"PeriodicalId\":10404,\"journal\":{\"name\":\"Clinical Orthopaedics and Related Research®\",\"volume\":\"97 8 1\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical Orthopaedics and Related Research®\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/corr.0000000000003701\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Orthopaedics and Related Research®","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/corr.0000000000003701","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

摘要

机器学习模型越来越多地用于骨科肿瘤学，以预测骨肉瘤患者的生存结果。通常，这些模型是在单个数据集上训练的，例如监测、流行病学和最终结果（SEER）或国家癌症数据库（NCDB）。然而，由于任何单一数据库，即使它很大，也可能强调不同的数据点并可能包含错误，因此在单个数据集上训练的模型可能学习数据库特定的模式，而不是通用的临床关系，这限制了它们在应用于不同患者群体时的临床效用。我们使用SEER和NCDB数据库开发了单独的机器学习模型，并(1)比较了SEER和ndb训练的模型在各自数据库上验证时估计2年和5年总生存期的准确性，(2)通过使用在一个数据库上训练的模型来外部验证另一个数据库，评估了哪个数据库产生了更通用的机器学习模型（定义为在应用于看不见的外部数据时保持高性能的模型）。(3)识别影响预测精度的关键因素。方法从2000年至2018年（SEER）和2004年至2018年（NCDB），我们确定了15241例SEER患者和11643例NCDB患者的骨肉瘤。排除四肢/骨盆外肿瘤患者，包括未确诊的骨肉瘤组织学结果（52% [7989]SEER, 22% [2537] NCDB）和未发现转移、治疗或预后数据的患者（20% [2974]SEER, 43% [5057] NCDB），我们纳入了4049例NCDB患者和4278例SEER患者，均为确诊的骨肉瘤。SEER提供基于人群的覆盖，包括详细的分期，但治疗信息有限，而NCDB提供基于医院的数据，包括全面的治疗细节。我们为每个数据集开发了单独的模型，随机将每个数据集分为训练集（80%）和验证集（20%）。这种分离是至关重要的，因为它使我们能够测试我们的模型在全新的，未见过的数据上的表现如何-测试模型是否适用于现实世界的临床实践。主要结果包括准确性（正确预测的比例）、受试者工作特征曲线下面积（AUC）（生存结果之间的判别能力，> 0.8表示表现良好）、Brier评分（概率预测的准确性，值< 0.25表示模型有用）、精度（正确的阳性预测的比例）、召回率（识别实际结果的敏感性）和F1评分（精度和召回率的调和平均值）。NCDB组中位患者年龄为22岁，SEER组为17岁（p = 0.005），性别分布相似（NCDB组为56%男性，SEER组为56%男性），但种族构成和总生存率不同（NCDB组2年和5年分别为72%和52%，而SEER组为65%和43%）。结果内部验证显示了优异的性能：ndb训练的模型在2年和5年的AUC分别为0.93（95%置信区间[CI] 0.92至0.94）和0.91(95%置信区间[CI] 0.90至0.92)，而seer训练的模型分别为0.90 （95% CI 0.89至0.91）和0.92 （95% CI 0.91至0.92）。AUC值> 0.90表明有较好的鉴别能力；这些模型可以可靠地区分哪些病人能活下来，哪些不能。NCDB和SEER模型之间的微小差异（95% CI 0.90至0.93）在重叠置信区间内没有临床意义。然而，外部验证显示了较差的可转移性：在SEER数据上测试的NCDB模型的AUC为0.67 （95% CI 0.65至0.68）和0.60 (95% CI 0.58至0.62)，而在NCDB数据上测试的SEER模型的AUC为0.61 （95% CI 0.59至0.62）和0.56 （95% CI 0.55至0.58）。这些外部验证AUC值< 0.70表明预测性能较差（仅略好于偶然性，不适合临床决策）。这种显著的性能下降表明模型不能在不同的医疗保健数据库之间可靠地传输。NCDB模型优先考虑治疗变量，而SEER模型强调人口因素，反映了每个数据库捕获的不同临床信息，并解释了跨数据库应用失败的原因。结论：模型应该在应用它们的相同数据库环境中进行验证。这些结果突出了NCDB和SEER数据集之间的差异，表明模型学习的是数据库特定的模式，而不是可推广的疾病模式。模型的跨数据库应用会导致较差的预测性能，在没有重新验证的情况下应该避免。证据等级：III级，预后研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bias or Best Fit? A Comparative Analysis of the SEER and NCDB Data Sets in Single-model Machine Learning for Predicting Osteosarcoma Survival Outcomes.

BACKGROUND Machine-learning models are increasingly used in orthopaedic oncology to predict survival outcomes for patients with osteosarcoma. Typically, these models are trained on a single data set, such as the Surveillance, Epidemiology, and End Results (SEER) or the National Cancer Database (NCDB). However, because any single database, even if it is large, may emphasize different data points and may include errors, models trained on single data sets may learn database-specific patterns rather than generalizable clinical relationships, limiting their clinical utility when applied to different patient populations. QUESTIONS/PURPOSES We developed separate machine-learning models using SEER and NCDB databases and (1) compared the accuracy of SEER- and NCDB-trained models in estimating 2- and 5-year overall survival when validated on their respective databases, (2) assessed which database produced a more generalizable machine-learning model (defined as one that maintains high performance when applied to unseen external data) by using the model trained on one database to externally validate the other, and (3) identified key factors contributing to prediction accuracy. METHODS From 2000 to 2018 (SEER) and 2004 to 2018 (NCDB), we identified 15,241 SEER patients and 11,643 NCDB patients with osteosarcoma. After excluding patients with tumors outside the extremities/pelvis, including unconfirmed osteosarcoma histology results (52% [7989] SEER, 22% [2537] NCDB) and those with missing metastasis, treatment, or prognosis data (20% [2974] SEER, 43% [5057] NCDB), we included 4049 patients from NCDB and 4278 patients from SEER, all with confirmed osteosarcoma. SEER provides population-based coverage with detailed staging but limited treatment information, while NCDB offers hospital-based data with comprehensive treatment details. We developed separate models for each data set, randomly splitting each into training (80%) and validation (20%) sets. This separation was crucial because it allowed us to test how well our models performed on completely new, unseen data-to test whether a model will work in real-world clinical practice. Primary outcomes included accuracy (proportion of correct predictions), area under the receiver operating characteristic curve (AUC) (discriminative ability between survival outcomes, with values > 0.8 indicating good performance), Brier score (probabilistic prediction accuracy, with values < 0.25 indicating useful models), precision (proportion of positive predictions that were correct), recall (sensitivity for identifying actual outcomes), and F1 score (harmonic mean of precision and recall). The median patient age was 22 years in the NCDB versus 17 years in SEER (p = 0.005), with similar sex distributions (56% male in NCDB, 56% male in SEER) but different racial compositions and overall survival rates (72% and 52% at 2 and 5 years, respectively, for NCDB versus 65% and 43% for SEER). RESULTS Internal validation showed excellent performance: NCDB-trained models achieved an AUC of 0.93 (95% confidence interval [CI] 0.92 to 0.94) at 2 years and 0.91 (95% CI 0.90 to 0.92) at 5 years, while SEER-trained models achieved 0.90 (95% CI 0.89 to 0.91) and 0.92 (95% CI 0.91 to 0.92), respectively. These AUC values > 0.90 indicate excellent discriminative ability; the models can reliably distinguish between patients who will survive and those who will not. The small differences between NCDB and SEER models (95% CI 0.90 to 0.93) are not clinically meaningful given overlapping confidence intervals. However, external validation revealed poor transferability: NCDB models tested on SEER data achieved an AUC of 0.67 (95% CI 0.65 to 0.68) and 0.60 (95% CI 0.58 to 0.62), while SEER models tested on NCDB data achieved 0.61 (95% CI 0.59 to 0.62) and 0.56 (95% CI 0.55 to 0.58). These external validation AUC values < 0.70 indicate poor predictive performance (barely better than chance and unsuitable for clinical decision-making). This dramatic performance drop demonstrates that models cannot be reliably transferred between different healthcare databases. NCDB models prioritized treatment variables, while SEER models emphasized demographic factors, reflecting the different clinical information captured by each database and explaining why cross-database application fails. CONCLUSION Models should be validated within the same database environment where they will be applied. These results highlight differences between the NCDB and SEER data sets, showing that models learn database-specific patterns rather than generalizable disease patterns. Cross-database application of models leads to poor predictive performance and should be avoided without revalidation. LEVEL OF EVIDENCE Level III, prognostic study.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Clinical Orthopaedics and Related Research® 医学-外科

CiteScore

7.00

自引率

11.90%

发文量

722

审稿时长

2.5 months

期刊介绍： Clinical Orthopaedics and Related Research® is a leading peer-reviewed journal devoted to the dissemination of new and important orthopaedic knowledge. CORR® brings readers the latest clinical and basic research, along with columns, commentaries, and interviews with authors.