用于预测医疗保健利用和风险因素的机器学习框架

Healthcare analytics (New York, N.Y.) Pub Date : 2025-08-19 DOI:10.1016/j.health.2025.100411

Yead Rahman , Prerna Dua

{"title":"用于预测医疗保健利用和风险因素的机器学习框架","authors":"Yead Rahman , Prerna Dua","doi":"10.1016/j.health.2025.100411","DOIUrl":null,"url":null,"abstract":"<div><div>Medicaid data, with its vast scale and heterogeneity, presents significant challenges in predictive modeling and healthcare analytics. This study analyzes over 6.3 million records from the Louisiana Department of Health (LDH) to identify the most effective machine learning models for predicting clinical service utilization, COVID-19 infections, and tobacco use. A rigorous preprocessing pipeline ensured data integrity, while exploratory data analysis (EDA) guided feature selection, ultimately retaining 20 key variables to capture complex interactions. Seven supervised models, i.e., logistic regression, extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), random forest, decision tree, artificial neural networks (ANN), and naïve bayes, were evaluated based on predictive performance, computational efficiency, and feature importance. While ensemble methods such as XGBoost and random forest achieved superior accuracy, their high computational demands highlight the trade-off between performance and efficiency in large-scale healthcare analytics. Simpler models like naïve bayes and decision trees were computationally efficient but less accurate. Key predictors included hospital stay duration for healthcare service utilization, tobacco use for COVID-19 risk, and chronic obstructive pulmonary disease (COPD) for tobacco use. These findings emphasize the impact of comorbidities and demographics on healthcare utilization, offering data-driven insights for healthcare practitioners and policymakers to enhance patient care, optimize costs, and refine policy decisions.</div></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"8 ","pages":"Article 100411"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A machine learning framework for predicting healthcare utilization and risk factors\",\"authors\":\"Yead Rahman , Prerna Dua\",\"doi\":\"10.1016/j.health.2025.100411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Medicaid data, with its vast scale and heterogeneity, presents significant challenges in predictive modeling and healthcare analytics. This study analyzes over 6.3 million records from the Louisiana Department of Health (LDH) to identify the most effective machine learning models for predicting clinical service utilization, COVID-19 infections, and tobacco use. A rigorous preprocessing pipeline ensured data integrity, while exploratory data analysis (EDA) guided feature selection, ultimately retaining 20 key variables to capture complex interactions. Seven supervised models, i.e., logistic regression, extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), random forest, decision tree, artificial neural networks (ANN), and naïve bayes, were evaluated based on predictive performance, computational efficiency, and feature importance. While ensemble methods such as XGBoost and random forest achieved superior accuracy, their high computational demands highlight the trade-off between performance and efficiency in large-scale healthcare analytics. Simpler models like naïve bayes and decision trees were computationally efficient but less accurate. Key predictors included hospital stay duration for healthcare service utilization, tobacco use for COVID-19 risk, and chronic obstructive pulmonary disease (COPD) for tobacco use. These findings emphasize the impact of comorbidities and demographics on healthcare utilization, offering data-driven insights for healthcare practitioners and policymakers to enhance patient care, optimize costs, and refine policy decisions.</div></div>\",\"PeriodicalId\":73222,\"journal\":{\"name\":\"Healthcare analytics (New York, N.Y.)\",\"volume\":\"8 \",\"pages\":\"Article 100411\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare analytics (New York, N.Y.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772442525000309\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442525000309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

医疗补助数据由于其庞大的规模和异质性，在预测建模和医疗保健分析方面提出了重大挑战。这项研究分析了路易斯安那州卫生部（LDH）的630多万份记录，以确定预测临床服务利用、COVID-19感染和烟草使用的最有效的机器学习模型。严格的预处理流程确保了数据的完整性，而探索性数据分析（EDA）指导了特征选择，最终保留了20个关键变量来捕获复杂的交互。基于预测性能、计算效率和特征重要性评估了7种监督模型，即逻辑回归、极端梯度增强（XGBoost）、自适应增强（AdaBoost）、随机森林、决策树、人工神经网络（ANN）和naïve贝叶斯。虽然集成方法（如XGBoost和随机森林）实现了卓越的准确性，但它们的高计算需求突出了大规模医疗保健分析中性能和效率之间的权衡。更简单的模型，如naïve贝叶斯和决策树，计算效率高，但准确性较低。主要预测因素包括医疗服务使用的住院时间、COVID-19风险的烟草使用以及烟草使用的慢性阻塞性肺疾病（COPD）。这些发现强调了合并症和人口统计学对医疗保健利用的影响，为医疗保健从业者和政策制定者提供了数据驱动的见解，以加强患者护理，优化成本，并完善政策决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A machine learning framework for predicting healthcare utilization and risk factors

Medicaid data, with its vast scale and heterogeneity, presents significant challenges in predictive modeling and healthcare analytics. This study analyzes over 6.3 million records from the Louisiana Department of Health (LDH) to identify the most effective machine learning models for predicting clinical service utilization, COVID-19 infections, and tobacco use. A rigorous preprocessing pipeline ensured data integrity, while exploratory data analysis (EDA) guided feature selection, ultimately retaining 20 key variables to capture complex interactions. Seven supervised models, i.e., logistic regression, extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), random forest, decision tree, artificial neural networks (ANN), and naïve bayes, were evaluated based on predictive performance, computational efficiency, and feature importance. While ensemble methods such as XGBoost and random forest achieved superior accuracy, their high computational demands highlight the trade-off between performance and efficiency in large-scale healthcare analytics. Simpler models like naïve bayes and decision trees were computationally efficient but less accurate. Key predictors included hospital stay duration for healthcare service utilization, tobacco use for COVID-19 risk, and chronic obstructive pulmonary disease (COPD) for tobacco use. These findings emphasize the impact of comorbidities and demographics on healthcare utilization, offering data-driven insights for healthcare practitioners and policymakers to enhance patient care, optimize costs, and refine policy decisions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Healthcare analytics (New York, N.Y.) Applied Mathematics, Modelling and Simulation, Nursing and Health Professions (General)

CiteScore

4.40

自引率

0.00%

发文量

审稿时长

79 days