Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort

Xifeng Wu, Huakang Tu, Qingfeng Hu, Shan-Pou Tsai, David Ta-Wei Chu, C. Wen
{"title":"Novel machine learning algorithm in risk prediction model for pan-cancer risk: application in a large prospective cohort","authors":"Xifeng Wu, Huakang Tu, Qingfeng Hu, Shan-Pou Tsai, David Ta-Wei Chu, C. Wen","doi":"10.1136/bmjonc-2023-000087","DOIUrl":null,"url":null,"abstract":"\n\nTo develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population.\n\n\n\nThis study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950).\n\n\n\nDuring an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups.\n\n\n\nWe developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice.\n","PeriodicalId":505335,"journal":{"name":"BMJ Oncology","volume":"70 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Oncology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjonc-2023-000087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

To develop and validate machine-learning models that predict the risk of pan-cancer incidence using demographic, questionnaire and routine health check-up data in a large Asian population. This study is a prospective cohort study including 433 549 participants from the prospective MJ cohort including a male cohort (n=208 599) and a female cohort (n=224 950). During an 8-year median follow-up, 5143 cancers occurred in males and 4764 in females. Compared with Lasso-Cox and Random Survival Forests, XGBoost showed superior performance for both cohorts. The XGBoost model with all 155 features in males and 160 features in females achieved an area under the curve (AUC) of 0.877 and 0.750, respectively. Light models with 31 variables for males and 11 variables for females showed comparable performance: an AUC of 0.876 (95% CI 0.858 to 0.894) in the overall population and 0.818 (95% CI 0.795 to 0.841) in those aged ≥40 years in the male cohort and an AUC of 0.746 (95% CI 0.721 to 0.771) in the overall population and 0.641 (95% CI 0.605 to 0.677) in those aged ≥40 years in the female cohort. High-risk individuals have at least ninefold higher risk of pan-cancer incidence compared with low-risk groups. We developed and internally validated the first machine-learning models based on routine health check-up data to predict pan-cancer risk in the general population and achieved generally good discriminatory ability with a small set of predictors. External validation is warranted before the implementation of our risk model in clinical practice.
泛癌症风险预测模型中的新型机器学习算法:在大型前瞻性队列中的应用
本研究是一项前瞻性队列研究,包括来自前瞻性MJ队列的433 549名参与者,其中包括男性队列(n=208 599)和女性队列(n=224 950)。在8年的中位随访期间,男性和女性分别有5143人和4764人罹患癌症。与 Lasso-Cox 和随机生存森林相比,XGBoost 在两个队列中都表现出更优越的性能。包含所有 155 个特征的 XGBoost 模型(男性)和包含 160 个特征的 XGBoost 模型(女性)的曲线下面积(AUC)分别为 0.877 和 0.750。包含男性 31 个变量和女性 11 个变量的轻模型显示出了相当的性能:在总体人群中,AUC 为 0.876(95% CI 0.858 至 0.894),在年龄≥18 岁的人群中,AUC 为 0.818(95% CI 0.795 至 0.841)。男性队列中年龄≥40 岁者的 AUC 为 0.746(95% CI 0.721 至 0.771),女性队列中年龄≥40 岁者的 AUC 为 0.641(95% CI 0.605 至 0.677)。与低风险人群相比,高风险人群的泛癌症发病风险至少高出九倍。我们开发了首个基于常规健康体检数据的机器学习模型,用于预测普通人群的泛癌症风险,并进行了内部验证,在使用少量预测因子的情况下取得了普遍良好的判别能力。在将我们的风险模型应用于临床实践之前,还需要进行外部验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信