{"title":"基于Apache Spark聚类的糖尿病患者再入院预测模型的深度神经网络泛化","authors":"Fatma Al-Rubaei, M. Alhanjouri","doi":"10.1109/iCareTech49914.2020.00030","DOIUrl":null,"url":null,"abstract":"The high readmission rate, which is the percentage of patients who admitted to the hospital within a specific period after they have been discharged, is a major concern for many healthcare organizations. Reducing it would help significantly improving healthcare services by lowering the pressure on reception, focusing the resources on cases that need special care which could help to save their lives, being cost-efficient, and finally, it would provide a better life quality for the patients. Reducing the readmission rate could be done by studying the relation between the readmission rate and the factors of the patients such as age, race, number of diagnoses, and others. Many studies have studied this relation using different machine learning techniques such as unsupervised learning, which includes clustering algorithms such as k-means and supervised learning which includes regression and classification algorithms such as KNN, decision trees, and neural networks. This research uses a dataset of diabetic patients, which was gathered from 130 different US hospitals for years 1999–2008 with more than 100,000 instances and 55 different attributes to study the relationship between the readmission rate and other factors of the patients to determine the most influential factors that lead to a higher readmission rate and determine the most important factors that help to reduce readmission rate. Our novel solution consists of two stages, the first stage is developing predictive models for predicting readmission rates accurately, then use those models to identify the critical risk factors. As we experiment with different machine learning models, we report an accuracy of 53%, 35.7%, 35%, 11.6%, 50% on a held-out test set for KNN, Decision Trees, Random Forest, Support Vector, and deep neural networks respectively, and using ablation study we identified the top ten influential risk factors. As for the second stage, we reduced the computational time for training the machine learning models using the Spark cluster-computing framework to distribute the dataset across 4 workers to speed up the training process, and we report a 32% time reduction compared with running the models without Spark.","PeriodicalId":164473,"journal":{"name":"2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Generalization of Deep Neural Network of Hospital Readmission Prediction Models for Diabetes Patients Using Apache Spark Clustering\",\"authors\":\"Fatma Al-Rubaei, M. Alhanjouri\",\"doi\":\"10.1109/iCareTech49914.2020.00030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The high readmission rate, which is the percentage of patients who admitted to the hospital within a specific period after they have been discharged, is a major concern for many healthcare organizations. Reducing it would help significantly improving healthcare services by lowering the pressure on reception, focusing the resources on cases that need special care which could help to save their lives, being cost-efficient, and finally, it would provide a better life quality for the patients. Reducing the readmission rate could be done by studying the relation between the readmission rate and the factors of the patients such as age, race, number of diagnoses, and others. Many studies have studied this relation using different machine learning techniques such as unsupervised learning, which includes clustering algorithms such as k-means and supervised learning which includes regression and classification algorithms such as KNN, decision trees, and neural networks. This research uses a dataset of diabetic patients, which was gathered from 130 different US hospitals for years 1999–2008 with more than 100,000 instances and 55 different attributes to study the relationship between the readmission rate and other factors of the patients to determine the most influential factors that lead to a higher readmission rate and determine the most important factors that help to reduce readmission rate. Our novel solution consists of two stages, the first stage is developing predictive models for predicting readmission rates accurately, then use those models to identify the critical risk factors. As we experiment with different machine learning models, we report an accuracy of 53%, 35.7%, 35%, 11.6%, 50% on a held-out test set for KNN, Decision Trees, Random Forest, Support Vector, and deep neural networks respectively, and using ablation study we identified the top ten influential risk factors. As for the second stage, we reduced the computational time for training the machine learning models using the Spark cluster-computing framework to distribute the dataset across 4 workers to speed up the training process, and we report a 32% time reduction compared with running the models without Spark.\",\"PeriodicalId\":164473,\"journal\":{\"name\":\"2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech)\",\"volume\":\"132 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iCareTech49914.2020.00030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Assistive and Rehabilitation Technologies (iCareTech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iCareTech49914.2020.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Generalization of Deep Neural Network of Hospital Readmission Prediction Models for Diabetes Patients Using Apache Spark Clustering
The high readmission rate, which is the percentage of patients who admitted to the hospital within a specific period after they have been discharged, is a major concern for many healthcare organizations. Reducing it would help significantly improving healthcare services by lowering the pressure on reception, focusing the resources on cases that need special care which could help to save their lives, being cost-efficient, and finally, it would provide a better life quality for the patients. Reducing the readmission rate could be done by studying the relation between the readmission rate and the factors of the patients such as age, race, number of diagnoses, and others. Many studies have studied this relation using different machine learning techniques such as unsupervised learning, which includes clustering algorithms such as k-means and supervised learning which includes regression and classification algorithms such as KNN, decision trees, and neural networks. This research uses a dataset of diabetic patients, which was gathered from 130 different US hospitals for years 1999–2008 with more than 100,000 instances and 55 different attributes to study the relationship between the readmission rate and other factors of the patients to determine the most influential factors that lead to a higher readmission rate and determine the most important factors that help to reduce readmission rate. Our novel solution consists of two stages, the first stage is developing predictive models for predicting readmission rates accurately, then use those models to identify the critical risk factors. As we experiment with different machine learning models, we report an accuracy of 53%, 35.7%, 35%, 11.6%, 50% on a held-out test set for KNN, Decision Trees, Random Forest, Support Vector, and deep neural networks respectively, and using ablation study we identified the top ten influential risk factors. As for the second stage, we reduced the computational time for training the machine learning models using the Spark cluster-computing framework to distribute the dataset across 4 workers to speed up the training process, and we report a 32% time reduction compared with running the models without Spark.