Comparison of spatial prediction models from Machine Learning of cholangiocarcinoma incidence in Thailand.

IF 3.6 2区 医学 Q1 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH
Oraya Sahat, Supot Kamsa-Ard, Apiradee Lim, Siriporn Kamsa-Ard, Matias Garcia-Constantino, Idongesit Ekerete
{"title":"Comparison of spatial prediction models from Machine Learning of cholangiocarcinoma incidence in Thailand.","authors":"Oraya Sahat, Supot Kamsa-Ard, Apiradee Lim, Siriporn Kamsa-Ard, Matias Garcia-Constantino, Idongesit Ekerete","doi":"10.1186/s12889-025-23119-y","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Cholangiocarcinoma (CCA) poses a significant public health challenge in Thailand, with notably high incidence rates. This study aimed to compare the performance of spatial prediction models using Machine Learning techniques to analyze the occurrence of CCA across Thailand.</p><p><strong>Methods: </strong>This retrospective cohort study analyzed CCA cases from four population-based cancer registries in Thailand, diagnosed between January 1, 2012, and December 31, 2021. The study employed Machine Learning models (Linear Regression, Random Forest, Neural Network, and Extreme Gradient Boosting (XGBoost)) to predict Age-Standardized Rates (ASR) of CCA based on spatial variables. Model performance was evaluated using Root Mean Square Error (RMSE) and R<sup>2</sup> with 70:30 train-test validation.</p><p><strong>Results: </strong>The study included 6,379 CCA cases, with a male predominance (4,075 cases; 63.9%) and a mean age of 66.2 years (standard deviation = 11.1 years). The northeastern region accounted for most of the cases (3,898 cases; 61.1%). The overall ASR of CCA was 8.9 per 100,000 person-years (95% CI: 8.7 to 9.2), with the northeastern region showing the highest incidence (ASR = 13.4 per 100,000 person-years; 95% CI: 12.9 to 13.8). In the overall dataset, the Random Forest model demonstrated better prediction performance in both the training (R<sup>2</sup> = 72.07%) and testing datasets (R<sup>2</sup> = 71.66%). Regional variations in model performance were observed, with Random Forest performing best in the northern, northeastern regions, while XGBoost excelled in the central and southern regions. The most important spatial predictors for CCA were elevation and distance from water sources.</p><p><strong>Conclusion: </strong>The Random Forest model demonstrated the highest efficiency in predicting CCA incidence rates in Thailand, though predictive performance varied across regions. Spatial factors effectively predicted ASR of CCA, providing valuable insights for national-level disease surveillance and targeted public health interventions. These findings support the development of region-specific approaches for CCA control using spatial epidemiology and machine learning techniques.</p>","PeriodicalId":9039,"journal":{"name":"BMC Public Health","volume":"25 1","pages":"2137"},"PeriodicalIF":3.6000,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12144797/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Public Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12889-025-23119-y","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Cholangiocarcinoma (CCA) poses a significant public health challenge in Thailand, with notably high incidence rates. This study aimed to compare the performance of spatial prediction models using Machine Learning techniques to analyze the occurrence of CCA across Thailand.

Methods: This retrospective cohort study analyzed CCA cases from four population-based cancer registries in Thailand, diagnosed between January 1, 2012, and December 31, 2021. The study employed Machine Learning models (Linear Regression, Random Forest, Neural Network, and Extreme Gradient Boosting (XGBoost)) to predict Age-Standardized Rates (ASR) of CCA based on spatial variables. Model performance was evaluated using Root Mean Square Error (RMSE) and R2 with 70:30 train-test validation.

Results: The study included 6,379 CCA cases, with a male predominance (4,075 cases; 63.9%) and a mean age of 66.2 years (standard deviation = 11.1 years). The northeastern region accounted for most of the cases (3,898 cases; 61.1%). The overall ASR of CCA was 8.9 per 100,000 person-years (95% CI: 8.7 to 9.2), with the northeastern region showing the highest incidence (ASR = 13.4 per 100,000 person-years; 95% CI: 12.9 to 13.8). In the overall dataset, the Random Forest model demonstrated better prediction performance in both the training (R2 = 72.07%) and testing datasets (R2 = 71.66%). Regional variations in model performance were observed, with Random Forest performing best in the northern, northeastern regions, while XGBoost excelled in the central and southern regions. The most important spatial predictors for CCA were elevation and distance from water sources.

Conclusion: The Random Forest model demonstrated the highest efficiency in predicting CCA incidence rates in Thailand, though predictive performance varied across regions. Spatial factors effectively predicted ASR of CCA, providing valuable insights for national-level disease surveillance and targeted public health interventions. These findings support the development of region-specific approaches for CCA control using spatial epidemiology and machine learning techniques.

泰国胆管癌发病率机器学习空间预测模型的比较
背景:胆管癌(CCA)在泰国是一个重大的公共卫生挑战,其发病率非常高。本研究旨在比较使用机器学习技术的空间预测模型的性能,以分析泰国CCA的发生。方法:本回顾性队列研究分析了2012年1月1日至2021年12月31日期间泰国四个基于人群的癌症登记处诊断的CCA病例。采用机器学习模型(线性回归、随机森林、神经网络和极端梯度提升(XGBoost))预测基于空间变量的CCA年龄标准化率(ASR)。采用均方根误差(RMSE)和R2进行模型性能评估,并采用70:30训练检验验证。结果:本研究纳入6379例CCA病例,男性居多(4075例;63.9%),平均年龄66.2岁(标准差= 11.1岁)。其中,东北地区发病最多(3898例);61.1%)。CCA的总ASR为8.9 / 10万人-年(95% CI: 8.7 - 9.2),其中东北地区发病率最高(ASR = 13.4 / 10万人-年;95% CI: 12.9 ~ 13.8)。在整个数据集中,随机森林模型在训练数据集(R2 = 72.07%)和测试数据集(R2 = 71.66%)上都表现出更好的预测性能。模型性能存在区域差异,Random Forest在北部和东北部地区表现最佳,而XGBoost在中部和南部地区表现出色。最重要的空间预测因子是海拔和离水源的距离。结论:随机森林模型在预测泰国CCA发病率方面表现出最高的效率,尽管预测效果在不同地区有所不同。空间因子能有效预测CCA的ASR,为国家级疾病监测和有针对性的公共卫生干预提供有价值的见解。这些发现支持使用空间流行病学和机器学习技术开发特定区域的CCA控制方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMC Public Health
BMC Public Health 医学-公共卫生、环境卫生与职业卫生
CiteScore
6.50
自引率
4.40%
发文量
2108
审稿时长
1 months
期刊介绍: BMC Public Health is an open access, peer-reviewed journal that considers articles on the epidemiology of disease and the understanding of all aspects of public health. The journal has a special focus on the social determinants of health, the environmental, behavioral, and occupational correlates of health and disease, and the impact of health policies, practices and interventions on the community.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信