基于Apache Spark聚类的糖尿病患者再入院预测模型的深度神经网络泛化

2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech) Pub Date : 2020-08-01 DOI:10.1109/iCareTech49914.2020.00030

Fatma Al-Rubaei, M. Alhanjouri

{"title":"基于Apache Spark聚类的糖尿病患者再入院预测模型的深度神经网络泛化","authors":"Fatma Al-Rubaei, M. Alhanjouri","doi":"10.1109/iCareTech49914.2020.00030","DOIUrl":null,"url":null,"abstract":"The high readmission rate, which is the percentage of patients who admitted to the hospital within a specific period after they have been discharged, is a major concern for many healthcare organizations. Reducing it would help significantly improving healthcare services by lowering the pressure on reception, focusing the resources on cases that need special care which could help to save their lives, being cost-efficient, and finally, it would provide a better life quality for the patients. Reducing the readmission rate could be done by studying the relation between the readmission rate and the factors of the patients such as age, race, number of diagnoses, and others. Many studies have studied this relation using different machine learning techniques such as unsupervised learning, which includes clustering algorithms such as k-means and supervised learning which includes regression and classification algorithms such as KNN, decision trees, and neural networks. This research uses a dataset of diabetic patients, which was gathered from 130 different US hospitals for years 1999–2008 with more than 100,000 instances and 55 different attributes to study the relationship between the readmission rate and other factors of the patients to determine the most influential factors that lead to a higher readmission rate and determine the most important factors that help to reduce readmission rate. Our novel solution consists of two stages, the first stage is developing predictive models for predicting readmission rates accurately, then use those models to identify the critical risk factors. As we experiment with different machine learning models, we report an accuracy of 53%, 35.7%, 35%, 11.6%, 50% on a held-out test set for KNN, Decision Trees, Random Forest, Support Vector, and deep neural networks respectively, and using ablation study we identified the top ten influential risk factors. As for the second stage, we reduced the computational time for training the machine learning models using the Spark cluster-computing framework to distribute the dataset across 4 workers to speed up the training process, and we report a 32% time reduction compared with running the models without Spark.","PeriodicalId":164473,"journal":{"name":"2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Generalization of Deep Neural Network of Hospital Readmission Prediction Models for Diabetes Patients Using Apache Spark Clustering\",\"authors\":\"Fatma Al-Rubaei, M. Alhanjouri\",\"doi\":\"10.1109/iCareTech49914.2020.00030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The high readmission rate, which is the percentage of patients who admitted to the hospital within a specific period after they have been discharged, is a major concern for many healthcare organizations. Reducing it would help significantly improving healthcare services by lowering the pressure on reception, focusing the resources on cases that need special care which could help to save their lives, being cost-efficient, and finally, it would provide a better life quality for the patients. Reducing the readmission rate could be done by studying the relation between the readmission rate and the factors of the patients such as age, race, number of diagnoses, and others. Many studies have studied this relation using different machine learning techniques such as unsupervised learning, which includes clustering algorithms such as k-means and supervised learning which includes regression and classification algorithms such as KNN, decision trees, and neural networks. This research uses a dataset of diabetic patients, which was gathered from 130 different US hospitals for years 1999–2008 with more than 100,000 instances and 55 different attributes to study the relationship between the readmission rate and other factors of the patients to determine the most influential factors that lead to a higher readmission rate and determine the most important factors that help to reduce readmission rate. Our novel solution consists of two stages, the first stage is developing predictive models for predicting readmission rates accurately, then use those models to identify the critical risk factors. As we experiment with different machine learning models, we report an accuracy of 53%, 35.7%, 35%, 11.6%, 50% on a held-out test set for KNN, Decision Trees, Random Forest, Support Vector, and deep neural networks respectively, and using ablation study we identified the top ten influential risk factors. As for the second stage, we reduced the computational time for training the machine learning models using the Spark cluster-computing framework to distribute the dataset across 4 workers to speed up the training process, and we report a 32% time reduction compared with running the models without Spark.\",\"PeriodicalId\":164473,\"journal\":{\"name\":\"2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech)\",\"volume\":\"132 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iCareTech49914.2020.00030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iCareTech49914.2020.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

高再入院率(即出院后特定时间内入院的患者百分比)是许多医疗保健组织关注的主要问题。减少它将通过降低接待压力，将资源集中在需要特殊护理的病例上，从而有助于挽救他们的生命，从而大大改善医疗保健服务，并具有成本效益，最后，它将为患者提供更好的生活质量。通过研究再入院率与患者年龄、种族、诊断次数等因素的关系，降低再入院率。许多研究使用不同的机器学习技术来研究这种关系，如无监督学习，包括聚类算法，如k-means和监督学习，包括回归和分类算法，如KNN，决策树和神经网络。本研究使用1999-2008年美国130家不同医院10万余例、55种不同属性的糖尿病患者数据集，研究患者再入院率与其他因素的关系，确定导致再入院率较高的最具影响力因素，确定有助于降低再入院率的最重要因素。我们的新解决方案包括两个阶段，第一阶段是开发预测模型，准确预测再入院率，然后使用这些模型识别关键风险因素。在不同机器学习模型的实验中，我们报告了KNN、决策树、随机森林、支持向量和深度神经网络的准确率分别为53%、35.7%、35%、11.6%和50%，并使用烧烧研究确定了十大影响风险因素。至于第二阶段，我们使用Spark集群计算框架减少了训练机器学习模型的计算时间，将数据集分布在4个工人上，以加快训练过程，我们报告与没有Spark运行模型相比，减少了32%的时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Generalization of Deep Neural Network of Hospital Readmission Prediction Models for Diabetes Patients Using Apache Spark Clustering

The high readmission rate, which is the percentage of patients who admitted to the hospital within a specific period after they have been discharged, is a major concern for many healthcare organizations. Reducing it would help significantly improving healthcare services by lowering the pressure on reception, focusing the resources on cases that need special care which could help to save their lives, being cost-efficient, and finally, it would provide a better life quality for the patients. Reducing the readmission rate could be done by studying the relation between the readmission rate and the factors of the patients such as age, race, number of diagnoses, and others. Many studies have studied this relation using different machine learning techniques such as unsupervised learning, which includes clustering algorithms such as k-means and supervised learning which includes regression and classification algorithms such as KNN, decision trees, and neural networks. This research uses a dataset of diabetic patients, which was gathered from 130 different US hospitals for years 1999–2008 with more than 100,000 instances and 55 different attributes to study the relationship between the readmission rate and other factors of the patients to determine the most influential factors that lead to a higher readmission rate and determine the most important factors that help to reduce readmission rate. Our novel solution consists of two stages, the first stage is developing predictive models for predicting readmission rates accurately, then use those models to identify the critical risk factors. As we experiment with different machine learning models, we report an accuracy of 53%, 35.7%, 35%, 11.6%, 50% on a held-out test set for KNN, Decision Trees, Random Forest, Support Vector, and deep neural networks respectively, and using ablation study we identified the top ten influential risk factors. As for the second stage, we reduced the computational time for training the machine learning models using the Spark cluster-computing framework to distribute the dataset across 4 workers to speed up the training process, and we report a 32% time reduction compared with running the models without Spark.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech)

自引率

0.00%

发文量