{"title":"A machine learning framework for predicting healthcare utilization and risk factors","authors":"Yead Rahman , Prerna Dua","doi":"10.1016/j.health.2025.100411","DOIUrl":null,"url":null,"abstract":"<div><div>Medicaid data, with its vast scale and heterogeneity, presents significant challenges in predictive modeling and healthcare analytics. This study analyzes over 6.3 million records from the Louisiana Department of Health (LDH) to identify the most effective machine learning models for predicting clinical service utilization, COVID-19 infections, and tobacco use. A rigorous preprocessing pipeline ensured data integrity, while exploratory data analysis (EDA) guided feature selection, ultimately retaining 20 key variables to capture complex interactions. Seven supervised models, i.e., logistic regression, extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), random forest, decision tree, artificial neural networks (ANN), and naïve bayes, were evaluated based on predictive performance, computational efficiency, and feature importance. While ensemble methods such as XGBoost and random forest achieved superior accuracy, their high computational demands highlight the trade-off between performance and efficiency in large-scale healthcare analytics. Simpler models like naïve bayes and decision trees were computationally efficient but less accurate. Key predictors included hospital stay duration for healthcare service utilization, tobacco use for COVID-19 risk, and chronic obstructive pulmonary disease (COPD) for tobacco use. These findings emphasize the impact of comorbidities and demographics on healthcare utilization, offering data-driven insights for healthcare practitioners and policymakers to enhance patient care, optimize costs, and refine policy decisions.</div></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"8 ","pages":"Article 100411"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442525000309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Medicaid data, with its vast scale and heterogeneity, presents significant challenges in predictive modeling and healthcare analytics. This study analyzes over 6.3 million records from the Louisiana Department of Health (LDH) to identify the most effective machine learning models for predicting clinical service utilization, COVID-19 infections, and tobacco use. A rigorous preprocessing pipeline ensured data integrity, while exploratory data analysis (EDA) guided feature selection, ultimately retaining 20 key variables to capture complex interactions. Seven supervised models, i.e., logistic regression, extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), random forest, decision tree, artificial neural networks (ANN), and naïve bayes, were evaluated based on predictive performance, computational efficiency, and feature importance. While ensemble methods such as XGBoost and random forest achieved superior accuracy, their high computational demands highlight the trade-off between performance and efficiency in large-scale healthcare analytics. Simpler models like naïve bayes and decision trees were computationally efficient but less accurate. Key predictors included hospital stay duration for healthcare service utilization, tobacco use for COVID-19 risk, and chronic obstructive pulmonary disease (COPD) for tobacco use. These findings emphasize the impact of comorbidities and demographics on healthcare utilization, offering data-driven insights for healthcare practitioners and policymakers to enhance patient care, optimize costs, and refine policy decisions.