Yu-Xuan Xiao , Yi-Xin Zou , Zhuo-Ying Li , Qiu-Ming Shen , Da-Ke Liu , Yu-Ting Tan , Hong-Lan Li , Yong-Bing Xiang
{"title":"A machine learning approach for a 15-year prediction model of liver cancer incidence: Results from two large Chinese population cohorts","authors":"Yu-Xuan Xiao , Yi-Xin Zou , Zhuo-Ying Li , Qiu-Ming Shen , Da-Ke Liu , Yu-Ting Tan , Hong-Lan Li , Yong-Bing Xiang","doi":"10.1016/j.annepidem.2025.10.015","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Primary liver cancer (PLC) remains a major public health concern, particularly in China where the incidence is high. Existing prediction models often focus on high-risk populations and depend heavily on laboratory data, which limits their utility in general population screening.</div></div><div><h3>Methods</h3><div>We developed and validated a 15-year PLC risk prediction model using data from two large prospective cohort studies in Shanghai (n = 132,360), including 618 incident PLC cases. Candidate variables encompassed sociodemographic characteristics, lifestyle behaviors, medical history, and dietary factors. Predictor selection was performed using LASSO regression and the Boruta algorithm. Five machine learning models and logistic regression were compared. Model performance was evaluated using AUC, calibration plots and net reclassification improvement (NRI). SHapley Additive exPlanations (SHAP) were used to interpret model predictions. Web-based tools, including a simplified risk calculator, were developed to facilitate practical application.</div></div><div><h3>Results</h3><div>LightGBM achieved the best discrimination (AUC = 0.766) and excellent calibration. Net reclassification analysis indicated an improved ability to correctly classify low-risk individuals. The model effectively stratified the population: the high-risk group had a 15-year PLC risk that was 39.56 times that of the low-risk group. SHAP analysis revealed biologically meaningful associations. A simplified logistic model with fewer variables also performed well (AUC = 0.762), supporting effective risk stratification.</div></div><div><h3>Conclusion</h3><div>We developed a questionnaire-based 15-year PLC risk prediction model applicable to the general Chinese population. Both the full and simplified models demonstrated strong performance and interpretability, making them valuable tools for large-scale screening and targeted prevention, especially in resource-limited settings.</div></div>","PeriodicalId":50767,"journal":{"name":"Annals of Epidemiology","volume":"112 ","pages":"Pages 28-37"},"PeriodicalIF":3.0000,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047279725003126","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Primary liver cancer (PLC) remains a major public health concern, particularly in China where the incidence is high. Existing prediction models often focus on high-risk populations and depend heavily on laboratory data, which limits their utility in general population screening.
Methods
We developed and validated a 15-year PLC risk prediction model using data from two large prospective cohort studies in Shanghai (n = 132,360), including 618 incident PLC cases. Candidate variables encompassed sociodemographic characteristics, lifestyle behaviors, medical history, and dietary factors. Predictor selection was performed using LASSO regression and the Boruta algorithm. Five machine learning models and logistic regression were compared. Model performance was evaluated using AUC, calibration plots and net reclassification improvement (NRI). SHapley Additive exPlanations (SHAP) were used to interpret model predictions. Web-based tools, including a simplified risk calculator, were developed to facilitate practical application.
Results
LightGBM achieved the best discrimination (AUC = 0.766) and excellent calibration. Net reclassification analysis indicated an improved ability to correctly classify low-risk individuals. The model effectively stratified the population: the high-risk group had a 15-year PLC risk that was 39.56 times that of the low-risk group. SHAP analysis revealed biologically meaningful associations. A simplified logistic model with fewer variables also performed well (AUC = 0.762), supporting effective risk stratification.
Conclusion
We developed a questionnaire-based 15-year PLC risk prediction model applicable to the general Chinese population. Both the full and simplified models demonstrated strong performance and interpretability, making them valuable tools for large-scale screening and targeted prevention, especially in resource-limited settings.
期刊介绍:
The journal emphasizes the application of epidemiologic methods to issues that affect the distribution and determinants of human illness in diverse contexts. Its primary focus is on chronic and acute conditions of diverse etiologies and of major importance to clinical medicine, public health, and health care delivery.