{"title":"Machine learning for predicting the prognosis of patients with thymoma and thymic carcinoma.","authors":"Haijie Xu, Xirui Lin, Junhan Wu, Jianrong Chen, Jiaying Wu, Zheng Lin, Xiaoming Cai, Jiong Lin, Peishen Li, Chaoquan He, Zefeng Xie, Hansheng Wu","doi":"10.21037/jtd-24-1263","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Thymoma and thymic carcinoma are the most common tumors of the anterior mediastinum. However, there are little research on applying machine learning (ML) approaches to the prognostic prediction of thymoma and thymic carcinoma. The study aims to develop predictive models utilizing ML techniques to accurately forecast the 5-year survival of patients with thymoma and thymic carcinoma.</p><p><strong>Methods: </strong>Patients with malignant thymic neoplasms were identified in the Surveillance, Epidemiology, and End Results (SEER) 17 database, and their demographic and clinicopathological characteristics were collected. ML classifiers, including elastic net regularized logistic regression, random forest (RF), non-linear support vector machine (SVM), extreme gradient boosting (XGBoost) machine, and categorical boosting (CatBoost) were trained. The hyper-parameter of the algorithms was optimized by a grid search with five repeats of 10-fold cross-validation. Ensemble models were built based on the three algorithms with the highest area under the receiver operator characteristic (ROC) curve (AUC) in the validation set. The best model among the single models and ensemble model was selected as the final model. Calibration curve and decision curve were adopted to evaluate the calibration performance and clinical utility. For comparison, we constructed a baseline model consisting of age and Masaoka stages using logistic regression.</p><p><strong>Results: </strong>After data cleaning, 1,363 patients and 841 patients were included in the overall survival (OS) dataset and disease-specific survival (DSS) dataset, respectively. CatBoost [AUC: 0.755; 95% confidence interval (CI): 0.698-0.811] had the best performance in the OS prediction for the original dataset. The ensemble model achieved the highest prognostic efficiency for the original dataset, with an AUC of 0.833 (95% CI: 0.765-0.901). Calibration showed favorable goodness of fit and was further verified with the Hosmer-Lemeshow test (CatBoost: χ<sup>2</sup>=12.63, P=0.13; ensemble model: χ<sup>2</sup>=7.61, P=0.47). The decision curve showed that the final model provided a high net benefit. The model could significantly distinguish the prognosis of patients (all P values <0.001). Finally, World Health Organization (WHO) histological classification, Masaoka stage, and age were the variables that significantly contributed to the models' prediction of OS and DSS.</p><p><strong>Conclusions: </strong>We trained ML-based predictive models that could accurately predict the 5-year OS and DSS of patients with thymoma and thymic carcinoma.</p>","PeriodicalId":17542,"journal":{"name":"Journal of thoracic disease","volume":"17 2","pages":"824-835"},"PeriodicalIF":2.1000,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11898343/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of thoracic disease","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.21037/jtd-24-1263","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/20 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RESPIRATORY SYSTEM","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Thymoma and thymic carcinoma are the most common tumors of the anterior mediastinum. However, there are little research on applying machine learning (ML) approaches to the prognostic prediction of thymoma and thymic carcinoma. The study aims to develop predictive models utilizing ML techniques to accurately forecast the 5-year survival of patients with thymoma and thymic carcinoma.
Methods: Patients with malignant thymic neoplasms were identified in the Surveillance, Epidemiology, and End Results (SEER) 17 database, and their demographic and clinicopathological characteristics were collected. ML classifiers, including elastic net regularized logistic regression, random forest (RF), non-linear support vector machine (SVM), extreme gradient boosting (XGBoost) machine, and categorical boosting (CatBoost) were trained. The hyper-parameter of the algorithms was optimized by a grid search with five repeats of 10-fold cross-validation. Ensemble models were built based on the three algorithms with the highest area under the receiver operator characteristic (ROC) curve (AUC) in the validation set. The best model among the single models and ensemble model was selected as the final model. Calibration curve and decision curve were adopted to evaluate the calibration performance and clinical utility. For comparison, we constructed a baseline model consisting of age and Masaoka stages using logistic regression.
Results: After data cleaning, 1,363 patients and 841 patients were included in the overall survival (OS) dataset and disease-specific survival (DSS) dataset, respectively. CatBoost [AUC: 0.755; 95% confidence interval (CI): 0.698-0.811] had the best performance in the OS prediction for the original dataset. The ensemble model achieved the highest prognostic efficiency for the original dataset, with an AUC of 0.833 (95% CI: 0.765-0.901). Calibration showed favorable goodness of fit and was further verified with the Hosmer-Lemeshow test (CatBoost: χ2=12.63, P=0.13; ensemble model: χ2=7.61, P=0.47). The decision curve showed that the final model provided a high net benefit. The model could significantly distinguish the prognosis of patients (all P values <0.001). Finally, World Health Organization (WHO) histological classification, Masaoka stage, and age were the variables that significantly contributed to the models' prediction of OS and DSS.
Conclusions: We trained ML-based predictive models that could accurately predict the 5-year OS and DSS of patients with thymoma and thymic carcinoma.
期刊介绍:
The Journal of Thoracic Disease (JTD, J Thorac Dis, pISSN: 2072-1439; eISSN: 2077-6624) was founded in Dec 2009, and indexed in PubMed in Dec 2011 and Science Citation Index SCI in Feb 2013. It is published quarterly (Dec 2009- Dec 2011), bimonthly (Jan 2012 - Dec 2013), monthly (Jan. 2014-) and openly distributed worldwide. JTD received its impact factor of 2.365 for the year 2016. JTD publishes manuscripts that describe new findings and provide current, practical information on the diagnosis and treatment of conditions related to thoracic disease. All the submission and reviewing are conducted electronically so that rapid review is assured.