Assessment and recalibration of seventeen lung cancer risk prediction models in approximately one million Chinese population utilising healthcare big data: a retrospective cohort analysis
Ziqing Ye , Yexiang Sun , Yueqi Yin , Liya Liu , Miao Cui , Longyao Zhang , Yuantao Hao , David C. Christiani , Hongbo Lin , Peng Shen , Yongyue Wei
{"title":"Assessment and recalibration of seventeen lung cancer risk prediction models in approximately one million Chinese population utilising healthcare big data: a retrospective cohort analysis","authors":"Ziqing Ye , Yexiang Sun , Yueqi Yin , Liya Liu , Miao Cui , Longyao Zhang , Yuantao Hao , David C. Christiani , Hongbo Lin , Peng Shen , Yongyue Wei","doi":"10.1016/j.lanwpc.2025.101575","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>A number of lung cancer prediction models have been developed worldwide. However, there have been limited validation studies conducted specifically on Chinese populations. The objective of this study is to evaluate the feasibility and performance of 17 global lung cancer risk prediction models when applied to Chinese healthcare big data.</div></div><div><h3>Methods</h3><div>The study encompassed individuals whose information was recorded in the Yinzhou Regional Health Care Database (YRHCD) between January 1, 2010 and December 31, 2021. The 17 lung cancer risk prediction models, which comprised the Bach, the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial 2012 model (PLCO<sub>m2012</sub>), the Korean Men, the PLCO<sub>all2014</sub>, the Pittsburgh Predictor, Liverpool Lung Project Risk Prediction Model for Lung Cancer Incidence (LLPi), the Lung Cancer Risk Assessment Tool (LCRAT), Constrained LCRAT, the Nord-Trøndelag Health Study (HUNT), the Japan Public Health Center-based study (JPHC), Reduced HUNT, the PLCO<sub>m2012</sub> without information of race (PLCO<sub>m2012-norace</sub>), the Liverpool Lung Project version 3 (LLPv3), Lung Cancer Risk Score (LCRS), the Optimized Early Warning Model for Lung Cancer Risk (OWL), the University College London-Incidence (UCL-I), the Shanghai Lung Cancer incidence Model (Shanghai-LCM), were evaluated for their performance in overall population and subgroups stratified by age and sex. The discrimination of the 17 models was assessed using Harrell's C-index and time-dependent area under the curve (AUC). The calibration of the models was evaluated using the expected-to-observed ratio (EOR) and calibration curves. Moreover, the models were recalibrated in the Yinzhou population, and the calibration of the recalibrated models was evaluated. For each model before and after recalibration, we redefined risk thresholds that would select the same number of individuals as the China National Lung Cancer Screening Guideline with Low-dose Computed Tomography 2023 Version (CNLCS 2023) could screen out. The Kaplan–Meier method was used to estimate the incidence and number of cases of lung cancer in individuals screened according to different criteria or models over a five-year follow-up period, and Kaplan–Meier survival curves were plotted.</div></div><div><h3>Findings</h3><div>A total of 904,667 study participants were included in the analysis, comprising 66,730 ever smokers and 837,937 never smokers. Among the 17 models initially considered, only six (Bach, Pittsburgh Predictor, JPHC, Reduced HUNT, Constrained LCRAT, UCL-I) had complete information of predictive variables available in the YRHCD. Most models showed similar levels of discrimination, with C-indices ranging from 0.78 (95% CI 0.74–0.82) to 0.88 (0.87–0.89) and time-dependent AUCs ranging from 0.74 (95% CI 0.73–0.75) to 0.88 (0.87–0.89). The majority of models showed an overestimation of incidence risk among ever smokers, with EORs ranging from 1.10 (95% CI 1.02–1.19) to 4.37 (4.16–4.58), and an underestimation among never smokers with a few models showing exceptions — EORs ranging from 0.12 (95% CI 0.11–0.14) to 1.30 (1.26–1.35). After recalibration, all models showed improved accuracy of predicted probability. The five-year incidence rates observed in the model-selected population, ranging from 0.81% (95% CI 0.64%–0.96%) to 1.29% (1.08%–1.48%), were consistently higher than that observed in the criteria-selected population (0.75%, 95% CI 0.59%–0.90%). Following recalibration, the five-year incidence rates in the model-selected population improved, ranging from 0.81% (95% CI 0.64%–0.96%) to 1.60% (1.36%–1.82%).</div></div><div><h3>Interpretation</h3><div>The majority of recalibrated models demonstrated comparable and favorable discrimination and calibration capability, and were capable of identifying individuals at an elevated risk of lung cancer with greater precision than the criteria. Models designed for the general population (such as LLPv3, LLPi, Korean Men, JPHC, and LCRS) are more appropriate for identifying high-risk groups compared to those exclusively for smokers.</div></div><div><h3>Funding</h3><div><span>National Natural Science Foundation of China</span>, <span>General Project of Zhejiang Provincial Medical and Health Technology Plan</span> for the Year 2024, <span>Natural Science Foundation of Zhejiang Province</span>.</div></div>","PeriodicalId":22792,"journal":{"name":"The Lancet Regional Health: Western Pacific","volume":"58 ","pages":"Article 101575"},"PeriodicalIF":7.6000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Lancet Regional Health: Western Pacific","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666606525001129","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background
A number of lung cancer prediction models have been developed worldwide. However, there have been limited validation studies conducted specifically on Chinese populations. The objective of this study is to evaluate the feasibility and performance of 17 global lung cancer risk prediction models when applied to Chinese healthcare big data.
Methods
The study encompassed individuals whose information was recorded in the Yinzhou Regional Health Care Database (YRHCD) between January 1, 2010 and December 31, 2021. The 17 lung cancer risk prediction models, which comprised the Bach, the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial 2012 model (PLCOm2012), the Korean Men, the PLCOall2014, the Pittsburgh Predictor, Liverpool Lung Project Risk Prediction Model for Lung Cancer Incidence (LLPi), the Lung Cancer Risk Assessment Tool (LCRAT), Constrained LCRAT, the Nord-Trøndelag Health Study (HUNT), the Japan Public Health Center-based study (JPHC), Reduced HUNT, the PLCOm2012 without information of race (PLCOm2012-norace), the Liverpool Lung Project version 3 (LLPv3), Lung Cancer Risk Score (LCRS), the Optimized Early Warning Model for Lung Cancer Risk (OWL), the University College London-Incidence (UCL-I), the Shanghai Lung Cancer incidence Model (Shanghai-LCM), were evaluated for their performance in overall population and subgroups stratified by age and sex. The discrimination of the 17 models was assessed using Harrell's C-index and time-dependent area under the curve (AUC). The calibration of the models was evaluated using the expected-to-observed ratio (EOR) and calibration curves. Moreover, the models were recalibrated in the Yinzhou population, and the calibration of the recalibrated models was evaluated. For each model before and after recalibration, we redefined risk thresholds that would select the same number of individuals as the China National Lung Cancer Screening Guideline with Low-dose Computed Tomography 2023 Version (CNLCS 2023) could screen out. The Kaplan–Meier method was used to estimate the incidence and number of cases of lung cancer in individuals screened according to different criteria or models over a five-year follow-up period, and Kaplan–Meier survival curves were plotted.
Findings
A total of 904,667 study participants were included in the analysis, comprising 66,730 ever smokers and 837,937 never smokers. Among the 17 models initially considered, only six (Bach, Pittsburgh Predictor, JPHC, Reduced HUNT, Constrained LCRAT, UCL-I) had complete information of predictive variables available in the YRHCD. Most models showed similar levels of discrimination, with C-indices ranging from 0.78 (95% CI 0.74–0.82) to 0.88 (0.87–0.89) and time-dependent AUCs ranging from 0.74 (95% CI 0.73–0.75) to 0.88 (0.87–0.89). The majority of models showed an overestimation of incidence risk among ever smokers, with EORs ranging from 1.10 (95% CI 1.02–1.19) to 4.37 (4.16–4.58), and an underestimation among never smokers with a few models showing exceptions — EORs ranging from 0.12 (95% CI 0.11–0.14) to 1.30 (1.26–1.35). After recalibration, all models showed improved accuracy of predicted probability. The five-year incidence rates observed in the model-selected population, ranging from 0.81% (95% CI 0.64%–0.96%) to 1.29% (1.08%–1.48%), were consistently higher than that observed in the criteria-selected population (0.75%, 95% CI 0.59%–0.90%). Following recalibration, the five-year incidence rates in the model-selected population improved, ranging from 0.81% (95% CI 0.64%–0.96%) to 1.60% (1.36%–1.82%).
Interpretation
The majority of recalibrated models demonstrated comparable and favorable discrimination and calibration capability, and were capable of identifying individuals at an elevated risk of lung cancer with greater precision than the criteria. Models designed for the general population (such as LLPv3, LLPi, Korean Men, JPHC, and LCRS) are more appropriate for identifying high-risk groups compared to those exclusively for smokers.
Funding
National Natural Science Foundation of China, General Project of Zhejiang Provincial Medical and Health Technology Plan for the Year 2024, Natural Science Foundation of Zhejiang Province.
世界范围内已经建立了许多肺癌预测模型。然而,专门针对中国人群进行的验证性研究有限。本研究的目的是评估17种全球肺癌风险预测模型应用于中国医疗大数据的可行性和性能。方法选取2010年1月1日至2021年12月31日在鄞州地区卫生保健数据库(YRHCD)中记录的个体为研究对象。17种肺癌风险预测模型,包括Bach、前列腺癌、肺癌、结直肠癌和卵巢癌筛查试验2012模型(PLCOm2012)、韩国男性、PLCOall2014、匹兹堡预测器、利物浦肺癌项目肺癌发病率风险预测模型(LLPi)、肺癌风险评估工具(LCRAT)、受限LCRAT、nord - tro ndelag健康研究(HUNT)、日本公共卫生中心研究(JPHC)、Reduced HUNT、对不含种族信息的PLCOm2012 (PLCOm2012-norace)、利物浦肺项目第3版(LLPv3)、肺癌风险评分(LCRS)、肺癌风险优化预警模型(OWL)、伦敦大学学院发病率(UCL-I)、上海肺癌发病率模型(Shanghai- lcm)在总体人群和按年龄和性别分层的亚组中的表现进行评估。采用Harrell’sc指数和随时间变化的曲线下面积(AUC)对17种模型的判别性进行了评价。使用期望与观测比(EOR)和校准曲线评估模型的校准。并在鄞州人口中对模型进行了重新校正,并对模型的校正效果进行了评价。对于重新校准前后的每个模型,我们重新定义了风险阈值,以选择与中国国家肺癌低剂量计算机断层扫描筛查指南2023版(CNLCS 2023)可以筛查出的相同数量的个体。采用Kaplan-Meier法对根据不同标准或模型筛选的个体进行5年随访,估计其肺癌发病率和病例数,绘制Kaplan-Meier生存曲线。共有904,667名研究参与者被纳入分析,其中包括66,730名吸烟者和837,937名从不吸烟者。在最初考虑的17个模型中,只有6个(Bach、Pittsburgh Predictor、JPHC、Reduced HUNT、Constrained LCRAT、UCL-I)具有YRHCD中可用的预测变量的完整信息。大多数模型显示出相似的歧视水平,c指数范围为0.78 (95% CI 0.74 - 0.82)至0.88(0.87-0.89),时间相关auc范围为0.74 (95% CI 0.73-0.75)至0.88(0.87-0.89)。大多数模型显示,对曾经吸烟者的发病率风险估计过高,eor范围为1.10 (95% CI 1.02-1.19)至4.37(4.16-4.58),对从未吸烟者的发病率风险估计过低,但有少数模型显示例外,eor范围为0.12 (95% CI 0.11-0.14)至1.30(1.26-1.35)。重新校准后,所有模型的预测概率精度都有所提高。在模型选择人群中观察到的5年发病率范围为0.81% (95% CI 0.64%-0.96%)至1.29%(1.08%-1.48%),始终高于标准选择人群(0.75%,95% CI 0.59%-0.90%)。重新校准后,模型选择人群的5年发病率得到改善,范围从0.81% (95% CI 0.64%-0.96%)到1.60%(1.36%-1.82%)。大多数重新校准的模型显示出可比性和良好的区分和校准能力,并且能够以比标准更高的精度识别肺癌风险升高的个体。与专为吸烟者设计的模型相比,为一般人群设计的模型(如LLPv3、LLPi、Korean Men、JPHC和LCRS)更适合于识别高危人群。国家自然科学基金,浙江省2024年医药卫生科技计划总体项目,浙江省自然科学基金。
期刊介绍:
The Lancet Regional Health – Western Pacific, a gold open access journal, is an integral part of The Lancet's global initiative advocating for healthcare quality and access worldwide. It aims to advance clinical practice and health policy in the Western Pacific region, contributing to enhanced health outcomes. The journal publishes high-quality original research shedding light on clinical practice and health policy in the region. It also includes reviews, commentaries, and opinion pieces covering diverse regional health topics, such as infectious diseases, non-communicable diseases, child and adolescent health, maternal and reproductive health, aging health, mental health, the health workforce and systems, and health policy.