Construction and validation of a machine learning model to predict the risk of nasopharyngeal carcinoma using multimodal clinical data: a single-center, retrospective study.
{"title":"Construction and validation of a machine learning model to predict the risk of nasopharyngeal carcinoma using multimodal clinical data: a single-center, retrospective study.","authors":"Xiao Li, Zuheng Wang, Wenting Chen, Chunmeng Wei, Wenhao Lu, Rongbin Zhou, Fubo Wang, Leifeng Liang","doi":"10.1007/s12094-025-03992-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Early detection and treatment of nasopharyngeal carcinoma (NPC) are critical for improving patient prognosis. The aim of this study is to develop and compare multiple machine learning (ML) models using multimodal clinical data to identify a predictive model for NPC risk, increase diagnostic accuracy, and guide personalized treatment strategies.</p><p><strong>Methods: </strong>Clinical data were retrospectively collected from 1337 patients suspected of having NPC at the First People's Hospital of Yulin. Feature selection was performed using the least absolute shrinkage and selection operator (LASSO) regression. Patients were divided into training and test sets (80:20 ratio), and seven ML models were developed based on the training set. Model performance was assessed using metrics such as the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. The best-performing model was further evaluated through decision curve analysis (DCA), calibration, and learning curves. SHapley Additive exPlanations (SHAP) were used to interpret key clinical features.</p><p><strong>Results: </strong>Seven models were developed using 17 clinical features selected from 53 parameters. The gradient boosting decision tree (GBDT) model demonstrated superior performance (AUC of 0.95 in the training cohort and 0.82 in the validation cohort). Calibration curves and DCA confirmed the model's strong accuracy and clinical benefit. SHAP analysis revealed that age, lymphocyte percentage, serum albumin, sex, and EBV IgM were the five most significant predictors of NPC risk.</p><p><strong>Conclusion: </strong>The GBDT-based ML model, using multimodal clinical data, accurately identifies patients at high risk for NPC, providing a valuable tool for early screening and personalized treatment strategies.</p>","PeriodicalId":50685,"journal":{"name":"Clinical & Translational Oncology","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical & Translational Oncology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12094-025-03992-0","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: Early detection and treatment of nasopharyngeal carcinoma (NPC) are critical for improving patient prognosis. The aim of this study is to develop and compare multiple machine learning (ML) models using multimodal clinical data to identify a predictive model for NPC risk, increase diagnostic accuracy, and guide personalized treatment strategies.
Methods: Clinical data were retrospectively collected from 1337 patients suspected of having NPC at the First People's Hospital of Yulin. Feature selection was performed using the least absolute shrinkage and selection operator (LASSO) regression. Patients were divided into training and test sets (80:20 ratio), and seven ML models were developed based on the training set. Model performance was assessed using metrics such as the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. The best-performing model was further evaluated through decision curve analysis (DCA), calibration, and learning curves. SHapley Additive exPlanations (SHAP) were used to interpret key clinical features.
Results: Seven models were developed using 17 clinical features selected from 53 parameters. The gradient boosting decision tree (GBDT) model demonstrated superior performance (AUC of 0.95 in the training cohort and 0.82 in the validation cohort). Calibration curves and DCA confirmed the model's strong accuracy and clinical benefit. SHAP analysis revealed that age, lymphocyte percentage, serum albumin, sex, and EBV IgM were the five most significant predictors of NPC risk.
Conclusion: The GBDT-based ML model, using multimodal clinical data, accurately identifies patients at high risk for NPC, providing a valuable tool for early screening and personalized treatment strategies.
期刊介绍:
Clinical and Translational Oncology is an international journal devoted to fostering interaction between experimental and clinical oncology. It covers all aspects of research on cancer, from the more basic discoveries dealing with both cell and molecular biology of tumour cells, to the most advanced clinical assays of conventional and new drugs. In addition, the journal has a strong commitment to facilitating the transfer of knowledge from the basic laboratory to the clinical practice, with the publication of educational series devoted to closing the gap between molecular and clinical oncologists. Molecular biology of tumours, identification of new targets for cancer therapy, and new technologies for research and treatment of cancer are the major themes covered by the educational series. Full research articles on a broad spectrum of subjects, including the molecular and cellular bases of disease, aetiology, pathophysiology, pathology, epidemiology, clinical features, and the diagnosis, prognosis and treatment of cancer, will be considered for publication.