Chengkun Sun, Erin Mobley, Michael Quillen, Max Parker, Meghan Daly, Rui Wang, Isabela Visintin, Ziad Awad, Jennifer Fishe, Alexander Parker, Thomas George, Jiang Bian, Jie Xu
{"title":"使用机器学习和真实世界数据预测低于筛查年龄的个体的早发性结直肠癌:病例对照研究。","authors":"Chengkun Sun, Erin Mobley, Michael Quillen, Max Parker, Meghan Daly, Rui Wang, Isabela Visintin, Ziad Awad, Jennifer Fishe, Alexander Parker, Thomas George, Jiang Bian, Jie Xu","doi":"10.2196/64506","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Colorectal cancer is now the leading cause of cancer-related deaths among young Americans. Accurate early prediction and a thorough understanding of the risk factors for early-onset colorectal cancer (EOCRC) are vital for effective prevention and treatment, particularly for patients below the recommended screening age.</p><p><strong>Objective: </strong>Our study aims to predict EOCRC using machine learning (ML) and structured electronic health record data for individuals under the screening age of 45 years, with the aim of exploring potential risk and protective factors that could support early diagnosis.</p><p><strong>Methods: </strong>We identified a cohort of patients under the age of 45 years from the OneFlorida+ Clinical Research Consortium. Given the distinct pathology of colon cancer (CC) and rectal cancer (RC), we created separate prediction models for each cancer type with various ML algorithms. We assessed multiple prediction time windows (ie, 0, 1, 3, and 5 y) and ensured robustness through propensity score matching to account for confounding variables including sex, race, ethnicity, and birth year. We conducted a comprehensive performance evaluation using metrics including area under the curve (AUC), sensitivity, specificity, positive predictive value, negative predictive value, and F1-score. Both linear (ie, logistic regression, support vector machine) and nonlinear (ie, Extreme Gradient Boosting and random forest) models were assessed to enable rigorous comparison across different classification strategies. In addition, we used the Shapley Additive Explanations to interpret the models and identify key risk and protective factors associated with EOCRC.</p><p><strong>Results: </strong>The final cohort included 1358 CC cases with 6790 matched controls, and 560 RC cases with 2800 matched controls. The RC group had a more balanced sex distribution (2:3 male-to-female) compared to the CC group (2:5 male-to-female), and both groups showed diverse racial and ethnic representation. Our predictive models demonstrated reasonable results, with AUC scores for CC prediction of 0.811, 0.748, 0.689, and 0.686 at 0, 1, 3, and 5 years before diagnosis, respectively. For RC prediction, AUC scores were 0.829, 0.771, 0.727, and 0.721 across the same time windows. Key predictive features across both cancer types included immune and digestive system disorders, secondary malignancies, and underweight status. In addition, blood diseases emerged as prominent indicators specifically for CC.</p><p><strong>Conclusions: </strong>Our findings demonstrate the potential of ML models leveraging electronic health record data to facilitate the early prediction of EOCRC in individuals under 45 years. By uncovering important risk factors and achieving promising predictive performance, this study provides preliminary insights that could inform future efforts toward earlier detection and prevention in younger populations.</p>","PeriodicalId":45538,"journal":{"name":"JMIR Cancer","volume":"11 ","pages":"e64506"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study.\",\"authors\":\"Chengkun Sun, Erin Mobley, Michael Quillen, Max Parker, Meghan Daly, Rui Wang, Isabela Visintin, Ziad Awad, Jennifer Fishe, Alexander Parker, Thomas George, Jiang Bian, Jie Xu\",\"doi\":\"10.2196/64506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Colorectal cancer is now the leading cause of cancer-related deaths among young Americans. Accurate early prediction and a thorough understanding of the risk factors for early-onset colorectal cancer (EOCRC) are vital for effective prevention and treatment, particularly for patients below the recommended screening age.</p><p><strong>Objective: </strong>Our study aims to predict EOCRC using machine learning (ML) and structured electronic health record data for individuals under the screening age of 45 years, with the aim of exploring potential risk and protective factors that could support early diagnosis.</p><p><strong>Methods: </strong>We identified a cohort of patients under the age of 45 years from the OneFlorida+ Clinical Research Consortium. Given the distinct pathology of colon cancer (CC) and rectal cancer (RC), we created separate prediction models for each cancer type with various ML algorithms. We assessed multiple prediction time windows (ie, 0, 1, 3, and 5 y) and ensured robustness through propensity score matching to account for confounding variables including sex, race, ethnicity, and birth year. We conducted a comprehensive performance evaluation using metrics including area under the curve (AUC), sensitivity, specificity, positive predictive value, negative predictive value, and F1-score. Both linear (ie, logistic regression, support vector machine) and nonlinear (ie, Extreme Gradient Boosting and random forest) models were assessed to enable rigorous comparison across different classification strategies. In addition, we used the Shapley Additive Explanations to interpret the models and identify key risk and protective factors associated with EOCRC.</p><p><strong>Results: </strong>The final cohort included 1358 CC cases with 6790 matched controls, and 560 RC cases with 2800 matched controls. The RC group had a more balanced sex distribution (2:3 male-to-female) compared to the CC group (2:5 male-to-female), and both groups showed diverse racial and ethnic representation. Our predictive models demonstrated reasonable results, with AUC scores for CC prediction of 0.811, 0.748, 0.689, and 0.686 at 0, 1, 3, and 5 years before diagnosis, respectively. For RC prediction, AUC scores were 0.829, 0.771, 0.727, and 0.721 across the same time windows. Key predictive features across both cancer types included immune and digestive system disorders, secondary malignancies, and underweight status. In addition, blood diseases emerged as prominent indicators specifically for CC.</p><p><strong>Conclusions: </strong>Our findings demonstrate the potential of ML models leveraging electronic health record data to facilitate the early prediction of EOCRC in individuals under 45 years. By uncovering important risk factors and achieving promising predictive performance, this study provides preliminary insights that could inform future efforts toward earlier detection and prevention in younger populations.</p>\",\"PeriodicalId\":45538,\"journal\":{\"name\":\"JMIR Cancer\",\"volume\":\"11 \",\"pages\":\"e64506\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Cancer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/64506\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Cancer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/64506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study.
Background: Colorectal cancer is now the leading cause of cancer-related deaths among young Americans. Accurate early prediction and a thorough understanding of the risk factors for early-onset colorectal cancer (EOCRC) are vital for effective prevention and treatment, particularly for patients below the recommended screening age.
Objective: Our study aims to predict EOCRC using machine learning (ML) and structured electronic health record data for individuals under the screening age of 45 years, with the aim of exploring potential risk and protective factors that could support early diagnosis.
Methods: We identified a cohort of patients under the age of 45 years from the OneFlorida+ Clinical Research Consortium. Given the distinct pathology of colon cancer (CC) and rectal cancer (RC), we created separate prediction models for each cancer type with various ML algorithms. We assessed multiple prediction time windows (ie, 0, 1, 3, and 5 y) and ensured robustness through propensity score matching to account for confounding variables including sex, race, ethnicity, and birth year. We conducted a comprehensive performance evaluation using metrics including area under the curve (AUC), sensitivity, specificity, positive predictive value, negative predictive value, and F1-score. Both linear (ie, logistic regression, support vector machine) and nonlinear (ie, Extreme Gradient Boosting and random forest) models were assessed to enable rigorous comparison across different classification strategies. In addition, we used the Shapley Additive Explanations to interpret the models and identify key risk and protective factors associated with EOCRC.
Results: The final cohort included 1358 CC cases with 6790 matched controls, and 560 RC cases with 2800 matched controls. The RC group had a more balanced sex distribution (2:3 male-to-female) compared to the CC group (2:5 male-to-female), and both groups showed diverse racial and ethnic representation. Our predictive models demonstrated reasonable results, with AUC scores for CC prediction of 0.811, 0.748, 0.689, and 0.686 at 0, 1, 3, and 5 years before diagnosis, respectively. For RC prediction, AUC scores were 0.829, 0.771, 0.727, and 0.721 across the same time windows. Key predictive features across both cancer types included immune and digestive system disorders, secondary malignancies, and underweight status. In addition, blood diseases emerged as prominent indicators specifically for CC.
Conclusions: Our findings demonstrate the potential of ML models leveraging electronic health record data to facilitate the early prediction of EOCRC in individuals under 45 years. By uncovering important risk factors and achieving promising predictive performance, this study provides preliminary insights that could inform future efforts toward earlier detection and prevention in younger populations.