{"title":"用于预测医疗保健利用和风险因素的机器学习框架","authors":"Yead Rahman , Prerna Dua","doi":"10.1016/j.health.2025.100411","DOIUrl":null,"url":null,"abstract":"<div><div>Medicaid data, with its vast scale and heterogeneity, presents significant challenges in predictive modeling and healthcare analytics. This study analyzes over 6.3 million records from the Louisiana Department of Health (LDH) to identify the most effective machine learning models for predicting clinical service utilization, COVID-19 infections, and tobacco use. A rigorous preprocessing pipeline ensured data integrity, while exploratory data analysis (EDA) guided feature selection, ultimately retaining 20 key variables to capture complex interactions. Seven supervised models, i.e., logistic regression, extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), random forest, decision tree, artificial neural networks (ANN), and naïve bayes, were evaluated based on predictive performance, computational efficiency, and feature importance. While ensemble methods such as XGBoost and random forest achieved superior accuracy, their high computational demands highlight the trade-off between performance and efficiency in large-scale healthcare analytics. Simpler models like naïve bayes and decision trees were computationally efficient but less accurate. Key predictors included hospital stay duration for healthcare service utilization, tobacco use for COVID-19 risk, and chronic obstructive pulmonary disease (COPD) for tobacco use. These findings emphasize the impact of comorbidities and demographics on healthcare utilization, offering data-driven insights for healthcare practitioners and policymakers to enhance patient care, optimize costs, and refine policy decisions.</div></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"8 ","pages":"Article 100411"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A machine learning framework for predicting healthcare utilization and risk factors\",\"authors\":\"Yead Rahman , Prerna Dua\",\"doi\":\"10.1016/j.health.2025.100411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Medicaid data, with its vast scale and heterogeneity, presents significant challenges in predictive modeling and healthcare analytics. This study analyzes over 6.3 million records from the Louisiana Department of Health (LDH) to identify the most effective machine learning models for predicting clinical service utilization, COVID-19 infections, and tobacco use. A rigorous preprocessing pipeline ensured data integrity, while exploratory data analysis (EDA) guided feature selection, ultimately retaining 20 key variables to capture complex interactions. Seven supervised models, i.e., logistic regression, extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), random forest, decision tree, artificial neural networks (ANN), and naïve bayes, were evaluated based on predictive performance, computational efficiency, and feature importance. While ensemble methods such as XGBoost and random forest achieved superior accuracy, their high computational demands highlight the trade-off between performance and efficiency in large-scale healthcare analytics. Simpler models like naïve bayes and decision trees were computationally efficient but less accurate. Key predictors included hospital stay duration for healthcare service utilization, tobacco use for COVID-19 risk, and chronic obstructive pulmonary disease (COPD) for tobacco use. These findings emphasize the impact of comorbidities and demographics on healthcare utilization, offering data-driven insights for healthcare practitioners and policymakers to enhance patient care, optimize costs, and refine policy decisions.</div></div>\",\"PeriodicalId\":73222,\"journal\":{\"name\":\"Healthcare analytics (New York, N.Y.)\",\"volume\":\"8 \",\"pages\":\"Article 100411\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare analytics (New York, N.Y.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772442525000309\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442525000309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A machine learning framework for predicting healthcare utilization and risk factors
Medicaid data, with its vast scale and heterogeneity, presents significant challenges in predictive modeling and healthcare analytics. This study analyzes over 6.3 million records from the Louisiana Department of Health (LDH) to identify the most effective machine learning models for predicting clinical service utilization, COVID-19 infections, and tobacco use. A rigorous preprocessing pipeline ensured data integrity, while exploratory data analysis (EDA) guided feature selection, ultimately retaining 20 key variables to capture complex interactions. Seven supervised models, i.e., logistic regression, extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), random forest, decision tree, artificial neural networks (ANN), and naïve bayes, were evaluated based on predictive performance, computational efficiency, and feature importance. While ensemble methods such as XGBoost and random forest achieved superior accuracy, their high computational demands highlight the trade-off between performance and efficiency in large-scale healthcare analytics. Simpler models like naïve bayes and decision trees were computationally efficient but less accurate. Key predictors included hospital stay duration for healthcare service utilization, tobacco use for COVID-19 risk, and chronic obstructive pulmonary disease (COPD) for tobacco use. These findings emphasize the impact of comorbidities and demographics on healthcare utilization, offering data-driven insights for healthcare practitioners and policymakers to enhance patient care, optimize costs, and refine policy decisions.