利用开放式医疗数据预测医疗成本的可解释分析模型

Healthcare analytics (New York, N.Y.) Pub Date : 2024-06-26 DOI:10.1016/j.health.2024.100351

A. Ravishankar Rao , Raunak Jain , Mrityunjai Singh , Rahul Garg

{"title":"利用开放式医疗数据预测医疗成本的可解释分析模型","authors":"A. Ravishankar Rao , Raunak Jain , Mrityunjai Singh , Rahul Garg","doi":"10.1016/j.health.2024.100351","DOIUrl":null,"url":null,"abstract":"<div><p>Healthcare expenditure, a considerable proportion of national budgets, has risen rapidly. Consequently, considerable research is devoted to controlling healthcare costs. Many efforts are underway to improve medical price transparency. Price transparency will help patients become better informed, allowing them to shop for care they can afford, eventually leading to efficiency in healthcare markets. This first requires medical pricing data to be made available publicly. Since the raw pricing data can be large and cover multiple conditions, it is necessary to provide an engine to process the data to facilitate its usage and understanding. We recommend creating computational models that predict healthcare costs for various patient conditions and demographics. Patients and providers can interrogate the underlying data to understand the variation of healthcare costs concerning medical conditions and demographic variables of interest, including age. We demonstrate our approach by creating predictive models using recent machine learning techniques. We analyzed anonymous patient data from the New York State Statewide Planning and Research Cooperative System, consisting of 2.34 million records from 2019. We built models to predict costs from over two dozen patient variables, including diagnosis codes, severity of illness, age, and other demographic variables. We investigated three models: regression, decision trees, and random forests. These models are explainable. We analyzed features to determine those that were predictive of total costs. We determined that the diagnosis code, severity of illness, and length of stay were good predictors of total costs, whereas race and gender are not useful in predicting total costs. We obtained the best performance using a catboost regressor, which yielded an R2 score of 0.85, better than the values reported in the literature.</p></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"6 ","pages":"Article 100351"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772442524000534/pdfft?md5=627ca7cad502b1be2f4f25cc21192d35&pid=1-s2.0-S2772442524000534-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Predictive interpretable analytics models for forecasting healthcare costs using open healthcare data\",\"authors\":\"A. Ravishankar Rao , Raunak Jain , Mrityunjai Singh , Rahul Garg\",\"doi\":\"10.1016/j.health.2024.100351\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Healthcare expenditure, a considerable proportion of national budgets, has risen rapidly. Consequently, considerable research is devoted to controlling healthcare costs. Many efforts are underway to improve medical price transparency. Price transparency will help patients become better informed, allowing them to shop for care they can afford, eventually leading to efficiency in healthcare markets. This first requires medical pricing data to be made available publicly. Since the raw pricing data can be large and cover multiple conditions, it is necessary to provide an engine to process the data to facilitate its usage and understanding. We recommend creating computational models that predict healthcare costs for various patient conditions and demographics. Patients and providers can interrogate the underlying data to understand the variation of healthcare costs concerning medical conditions and demographic variables of interest, including age. We demonstrate our approach by creating predictive models using recent machine learning techniques. We analyzed anonymous patient data from the New York State Statewide Planning and Research Cooperative System, consisting of 2.34 million records from 2019. We built models to predict costs from over two dozen patient variables, including diagnosis codes, severity of illness, age, and other demographic variables. We investigated three models: regression, decision trees, and random forests. These models are explainable. We analyzed features to determine those that were predictive of total costs. We determined that the diagnosis code, severity of illness, and length of stay were good predictors of total costs, whereas race and gender are not useful in predicting total costs. We obtained the best performance using a catboost regressor, which yielded an R2 score of 0.85, better than the values reported in the literature.</p></div>\",\"PeriodicalId\":73222,\"journal\":{\"name\":\"Healthcare analytics (New York, N.Y.)\",\"volume\":\"6 \",\"pages\":\"Article 100351\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2772442524000534/pdfft?md5=627ca7cad502b1be2f4f25cc21192d35&pid=1-s2.0-S2772442524000534-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare analytics (New York, N.Y.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772442524000534\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442524000534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

医疗保健支出在国家预算中占有相当大的比例，而且增长迅速。因此，很多研究都致力于控制医疗成本。为了提高医疗价格的透明度，很多人都在努力。价格透明将帮助患者更好地了解信息，使他们能够选择自己负担得起的医疗服务，最终提高医疗市场的效率。这首先需要公开医疗定价数据。由于原始定价数据可能非常庞大，而且涵盖多种疾病，因此有必要提供一个处理数据的引擎，以便于使用和理解这些数据。我们建议创建计算模型，预测不同病症和人口统计的医疗成本。患者和医疗服务提供者可以通过查询基础数据来了解医疗费用在病情和人口统计学变量（包括年龄）方面的变化。我们利用最新的机器学习技术创建了预测模型，展示了我们的方法。我们分析了来自纽约州全州规划与研究合作系统的匿名患者数据，其中包括 2019 年的 234 万条记录。我们根据二十多个患者变量（包括诊断代码、病情严重程度、年龄和其他人口统计学变量）建立了预测成本的模型。我们研究了三种模型：回归、决策树和随机森林。这些模型都是可以解释的。我们对特征进行了分析，以确定哪些特征可预测总费用。我们发现，诊断代码、病情严重程度和住院时间都能很好地预测总费用，而种族和性别则对预测总费用没有帮助。我们使用 catboost 回归器获得了最佳性能，其 R2 值为 0.85，优于文献报道的值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Predictive interpretable analytics models for forecasting healthcare costs using open healthcare data

Healthcare expenditure, a considerable proportion of national budgets, has risen rapidly. Consequently, considerable research is devoted to controlling healthcare costs. Many efforts are underway to improve medical price transparency. Price transparency will help patients become better informed, allowing them to shop for care they can afford, eventually leading to efficiency in healthcare markets. This first requires medical pricing data to be made available publicly. Since the raw pricing data can be large and cover multiple conditions, it is necessary to provide an engine to process the data to facilitate its usage and understanding. We recommend creating computational models that predict healthcare costs for various patient conditions and demographics. Patients and providers can interrogate the underlying data to understand the variation of healthcare costs concerning medical conditions and demographic variables of interest, including age. We demonstrate our approach by creating predictive models using recent machine learning techniques. We analyzed anonymous patient data from the New York State Statewide Planning and Research Cooperative System, consisting of 2.34 million records from 2019. We built models to predict costs from over two dozen patient variables, including diagnosis codes, severity of illness, age, and other demographic variables. We investigated three models: regression, decision trees, and random forests. These models are explainable. We analyzed features to determine those that were predictive of total costs. We determined that the diagnosis code, severity of illness, and length of stay were good predictors of total costs, whereas race and gender are not useful in predicting total costs. We obtained the best performance using a catboost regressor, which yielded an R2 score of 0.85, better than the values reported in the literature.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Healthcare analytics (New York, N.Y.) Applied Mathematics, Modelling and Simulation, Nursing and Health Professions (General)

CiteScore

4.40

自引率

0.00%

发文量

审稿时长

79 days