Bias or Best Fit? A Comparative Analysis of the SEER and NCDB Data Sets in Single-model Machine Learning for Predicting Osteosarcoma Survival Outcomes.
Andrew G Girgis,Bishoy M Galoaa,Megan H Goh,Marcos R Gonzalez,Santiago A Lozano-Calderón
{"title":"Bias or Best Fit? A Comparative Analysis of the SEER and NCDB Data Sets in Single-model Machine Learning for Predicting Osteosarcoma Survival Outcomes.","authors":"Andrew G Girgis,Bishoy M Galoaa,Megan H Goh,Marcos R Gonzalez,Santiago A Lozano-Calderón","doi":"10.1097/corr.0000000000003701","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nMachine-learning models are increasingly used in orthopaedic oncology to predict survival outcomes for patients with osteosarcoma. Typically, these models are trained on a single data set, such as the Surveillance, Epidemiology, and End Results (SEER) or the National Cancer Database (NCDB). However, because any single database, even if it is large, may emphasize different data points and may include errors, models trained on single data sets may learn database-specific patterns rather than generalizable clinical relationships, limiting their clinical utility when applied to different patient populations.\r\n\r\nQUESTIONS/PURPOSES\r\nWe developed separate machine-learning models using SEER and NCDB databases and (1) compared the accuracy of SEER- and NCDB-trained models in estimating 2- and 5-year overall survival when validated on their respective databases, (2) assessed which database produced a more generalizable machine-learning model (defined as one that maintains high performance when applied to unseen external data) by using the model trained on one database to externally validate the other, and (3) identified key factors contributing to prediction accuracy.\r\n\r\nMETHODS\r\nFrom 2000 to 2018 (SEER) and 2004 to 2018 (NCDB), we identified 15,241 SEER patients and 11,643 NCDB patients with osteosarcoma. After excluding patients with tumors outside the extremities/pelvis, including unconfirmed osteosarcoma histology results (52% [7989] SEER, 22% [2537] NCDB) and those with missing metastasis, treatment, or prognosis data (20% [2974] SEER, 43% [5057] NCDB), we included 4049 patients from NCDB and 4278 patients from SEER, all with confirmed osteosarcoma. SEER provides population-based coverage with detailed staging but limited treatment information, while NCDB offers hospital-based data with comprehensive treatment details. We developed separate models for each data set, randomly splitting each into training (80%) and validation (20%) sets. This separation was crucial because it allowed us to test how well our models performed on completely new, unseen data-to test whether a model will work in real-world clinical practice. Primary outcomes included accuracy (proportion of correct predictions), area under the receiver operating characteristic curve (AUC) (discriminative ability between survival outcomes, with values > 0.8 indicating good performance), Brier score (probabilistic prediction accuracy, with values < 0.25 indicating useful models), precision (proportion of positive predictions that were correct), recall (sensitivity for identifying actual outcomes), and F1 score (harmonic mean of precision and recall). The median patient age was 22 years in the NCDB versus 17 years in SEER (p = 0.005), with similar sex distributions (56% male in NCDB, 56% male in SEER) but different racial compositions and overall survival rates (72% and 52% at 2 and 5 years, respectively, for NCDB versus 65% and 43% for SEER).\r\n\r\nRESULTS\r\nInternal validation showed excellent performance: NCDB-trained models achieved an AUC of 0.93 (95% confidence interval [CI] 0.92 to 0.94) at 2 years and 0.91 (95% CI 0.90 to 0.92) at 5 years, while SEER-trained models achieved 0.90 (95% CI 0.89 to 0.91) and 0.92 (95% CI 0.91 to 0.92), respectively. These AUC values > 0.90 indicate excellent discriminative ability; the models can reliably distinguish between patients who will survive and those who will not. The small differences between NCDB and SEER models (95% CI 0.90 to 0.93) are not clinically meaningful given overlapping confidence intervals. However, external validation revealed poor transferability: NCDB models tested on SEER data achieved an AUC of 0.67 (95% CI 0.65 to 0.68) and 0.60 (95% CI 0.58 to 0.62), while SEER models tested on NCDB data achieved 0.61 (95% CI 0.59 to 0.62) and 0.56 (95% CI 0.55 to 0.58). These external validation AUC values < 0.70 indicate poor predictive performance (barely better than chance and unsuitable for clinical decision-making). This dramatic performance drop demonstrates that models cannot be reliably transferred between different healthcare databases. NCDB models prioritized treatment variables, while SEER models emphasized demographic factors, reflecting the different clinical information captured by each database and explaining why cross-database application fails.\r\n\r\nCONCLUSION\r\nModels should be validated within the same database environment where they will be applied. These results highlight differences between the NCDB and SEER data sets, showing that models learn database-specific patterns rather than generalizable disease patterns. Cross-database application of models leads to poor predictive performance and should be avoided without revalidation.\r\n\r\nLEVEL OF EVIDENCE\r\nLevel III, prognostic study.","PeriodicalId":10404,"journal":{"name":"Clinical Orthopaedics and Related Research®","volume":"97 8 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Orthopaedics and Related Research®","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/corr.0000000000003701","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
BACKGROUND
Machine-learning models are increasingly used in orthopaedic oncology to predict survival outcomes for patients with osteosarcoma. Typically, these models are trained on a single data set, such as the Surveillance, Epidemiology, and End Results (SEER) or the National Cancer Database (NCDB). However, because any single database, even if it is large, may emphasize different data points and may include errors, models trained on single data sets may learn database-specific patterns rather than generalizable clinical relationships, limiting their clinical utility when applied to different patient populations.
QUESTIONS/PURPOSES
We developed separate machine-learning models using SEER and NCDB databases and (1) compared the accuracy of SEER- and NCDB-trained models in estimating 2- and 5-year overall survival when validated on their respective databases, (2) assessed which database produced a more generalizable machine-learning model (defined as one that maintains high performance when applied to unseen external data) by using the model trained on one database to externally validate the other, and (3) identified key factors contributing to prediction accuracy.
METHODS
From 2000 to 2018 (SEER) and 2004 to 2018 (NCDB), we identified 15,241 SEER patients and 11,643 NCDB patients with osteosarcoma. After excluding patients with tumors outside the extremities/pelvis, including unconfirmed osteosarcoma histology results (52% [7989] SEER, 22% [2537] NCDB) and those with missing metastasis, treatment, or prognosis data (20% [2974] SEER, 43% [5057] NCDB), we included 4049 patients from NCDB and 4278 patients from SEER, all with confirmed osteosarcoma. SEER provides population-based coverage with detailed staging but limited treatment information, while NCDB offers hospital-based data with comprehensive treatment details. We developed separate models for each data set, randomly splitting each into training (80%) and validation (20%) sets. This separation was crucial because it allowed us to test how well our models performed on completely new, unseen data-to test whether a model will work in real-world clinical practice. Primary outcomes included accuracy (proportion of correct predictions), area under the receiver operating characteristic curve (AUC) (discriminative ability between survival outcomes, with values > 0.8 indicating good performance), Brier score (probabilistic prediction accuracy, with values < 0.25 indicating useful models), precision (proportion of positive predictions that were correct), recall (sensitivity for identifying actual outcomes), and F1 score (harmonic mean of precision and recall). The median patient age was 22 years in the NCDB versus 17 years in SEER (p = 0.005), with similar sex distributions (56% male in NCDB, 56% male in SEER) but different racial compositions and overall survival rates (72% and 52% at 2 and 5 years, respectively, for NCDB versus 65% and 43% for SEER).
RESULTS
Internal validation showed excellent performance: NCDB-trained models achieved an AUC of 0.93 (95% confidence interval [CI] 0.92 to 0.94) at 2 years and 0.91 (95% CI 0.90 to 0.92) at 5 years, while SEER-trained models achieved 0.90 (95% CI 0.89 to 0.91) and 0.92 (95% CI 0.91 to 0.92), respectively. These AUC values > 0.90 indicate excellent discriminative ability; the models can reliably distinguish between patients who will survive and those who will not. The small differences between NCDB and SEER models (95% CI 0.90 to 0.93) are not clinically meaningful given overlapping confidence intervals. However, external validation revealed poor transferability: NCDB models tested on SEER data achieved an AUC of 0.67 (95% CI 0.65 to 0.68) and 0.60 (95% CI 0.58 to 0.62), while SEER models tested on NCDB data achieved 0.61 (95% CI 0.59 to 0.62) and 0.56 (95% CI 0.55 to 0.58). These external validation AUC values < 0.70 indicate poor predictive performance (barely better than chance and unsuitable for clinical decision-making). This dramatic performance drop demonstrates that models cannot be reliably transferred between different healthcare databases. NCDB models prioritized treatment variables, while SEER models emphasized demographic factors, reflecting the different clinical information captured by each database and explaining why cross-database application fails.
CONCLUSION
Models should be validated within the same database environment where they will be applied. These results highlight differences between the NCDB and SEER data sets, showing that models learn database-specific patterns rather than generalizable disease patterns. Cross-database application of models leads to poor predictive performance and should be avoided without revalidation.
LEVEL OF EVIDENCE
Level III, prognostic study.
期刊介绍:
Clinical Orthopaedics and Related Research® is a leading peer-reviewed journal devoted to the dissemination of new and important orthopaedic knowledge.
CORR® brings readers the latest clinical and basic research, along with columns, commentaries, and interviews with authors.