Building a cancer risk and survival prediction model based on social determinants of health combined with machine learning: A NHANES 1999 to 2018 retrospective cohort study.

IF 1.3 4区 医学 Q2 MEDICINE, GENERAL & INTERNAL
Shiqi Zhang, Jianan Jin, Qi Zheng, Zhenyu Wang
{"title":"Building a cancer risk and survival prediction model based on social determinants of health combined with machine learning: A NHANES 1999 to 2018 retrospective cohort study.","authors":"Shiqi Zhang, Jianan Jin, Qi Zheng, Zhenyu Wang","doi":"10.1097/MD.0000000000041370","DOIUrl":null,"url":null,"abstract":"<p><p>The occurrence and progression of cancer is a significant focus of research worldwide, often accompanied by a prolonged disease course. Concurrently, researchers have identified that social determinants of health (SDOH) (employment status, family income and poverty ratio, food security, education level, access to healthcare services, health insurance, housing conditions, and marital status) are associated with the progression of many chronic diseases. However, there is a paucity of research examining the influence of SDOH on cancer incidence risk and the survival of cancer survivors. The aim of this study was to utilize SDOH as a primary predictive factor, integrated with machine learning models, to forecast both cancer risk and prognostic survival. This research is grounded in the SDOH data derived from the National Health and Nutrition Examination Survey dataset spanning 1999 to 2018. It employs methodologies including adaptive boosting, gradient boosting machine (GradientBoosting), random forest (RF), extreme gradient boosting, light gradient boosting machine, support vector machine, and logistic regression to develop models for predicting cancer risk and prognostic survival. The hyperparameters of these models-specifically, the number of estimators (100-200), maximum tree depth (10), learning rate (0.01-0.2), and regularization parameters-were optimized through grid search and cross-validation, followed by performance evaluation. Shapley Additive exPlanations plots were generated to visualize the influence of each feature. RF was the best model for predicting cancer risk (area under the curve: 0.92, accuracy: 0.84). Age, non-Hispanic White, sex, and housing status were the 4 most important characteristics of the RF model. Age, gender, employment status, and household income/poverty ratio were the 4 most important features in the gradient boosting machine model. The predictive models developed in this study exhibited strong performance in estimating cancer incidence risk and survival time, identifying several factors that significantly influence both cancer incidence risk and survival, thereby providing new evidence for cancer management. Despite the promising findings, this study acknowledges certain limitations, including the omission of risk factors in the cancer survivor survival model and potential biases inherent in the National Health and Nutrition Examination Survey dataset. Future research is warranted to further validate the model using external datasets.</p>","PeriodicalId":18549,"journal":{"name":"Medicine","volume":"104 6","pages":"e41370"},"PeriodicalIF":1.3000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11813008/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/MD.0000000000041370","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

Abstract

The occurrence and progression of cancer is a significant focus of research worldwide, often accompanied by a prolonged disease course. Concurrently, researchers have identified that social determinants of health (SDOH) (employment status, family income and poverty ratio, food security, education level, access to healthcare services, health insurance, housing conditions, and marital status) are associated with the progression of many chronic diseases. However, there is a paucity of research examining the influence of SDOH on cancer incidence risk and the survival of cancer survivors. The aim of this study was to utilize SDOH as a primary predictive factor, integrated with machine learning models, to forecast both cancer risk and prognostic survival. This research is grounded in the SDOH data derived from the National Health and Nutrition Examination Survey dataset spanning 1999 to 2018. It employs methodologies including adaptive boosting, gradient boosting machine (GradientBoosting), random forest (RF), extreme gradient boosting, light gradient boosting machine, support vector machine, and logistic regression to develop models for predicting cancer risk and prognostic survival. The hyperparameters of these models-specifically, the number of estimators (100-200), maximum tree depth (10), learning rate (0.01-0.2), and regularization parameters-were optimized through grid search and cross-validation, followed by performance evaluation. Shapley Additive exPlanations plots were generated to visualize the influence of each feature. RF was the best model for predicting cancer risk (area under the curve: 0.92, accuracy: 0.84). Age, non-Hispanic White, sex, and housing status were the 4 most important characteristics of the RF model. Age, gender, employment status, and household income/poverty ratio were the 4 most important features in the gradient boosting machine model. The predictive models developed in this study exhibited strong performance in estimating cancer incidence risk and survival time, identifying several factors that significantly influence both cancer incidence risk and survival, thereby providing new evidence for cancer management. Despite the promising findings, this study acknowledges certain limitations, including the omission of risk factors in the cancer survivor survival model and potential biases inherent in the National Health and Nutrition Examination Survey dataset. Future research is warranted to further validate the model using external datasets.

求助全文
约1分钟内获得全文 求助全文
来源期刊
Medicine
Medicine 医学-医学:内科
CiteScore
2.80
自引率
0.00%
发文量
4342
审稿时长
>12 weeks
期刊介绍: Medicine is now a fully open access journal, providing authors with a distinctive new service offering continuous publication of original research across a broad spectrum of medical scientific disciplines and sub-specialties. As an open access title, Medicine will continue to provide authors with an established, trusted platform for the publication of their work. To ensure the ongoing quality of Medicine’s content, the peer-review process will only accept content that is scientifically, technically and ethically sound, and in compliance with standard reporting guidelines.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信