{"title":"Performance Assessment of Machine Learning Based Models for Diabetes Prediction","authors":"R. Deo, S. Panigrahi","doi":"10.1109/HI-POCT45284.2019.8962811","DOIUrl":null,"url":null,"abstract":"Diabetes is a major chronic disease which impacts all age groups. It has increasing prevalence worldwide. Certain factors increase the chances of diabetes occurrence in individuals. Prediction-based modeling has been used previously to provide a prevention based approach to diabetes. Prediction models have predominantly been based on regression and feature elimination. In this paper, a machine learning-based approach is presented to predict the individual diabetes occurrence based on specific lifestyle, and demographic factors. A publicly available dataset - continuous NHANES, was used. To account for small data size due to missing data and class imbalanced data, certain statistical techniques were applied. Synthetic minority over sampling technique was used via Gower’s distance calculation to avoid class imbalanced data. Additionally, principal component analysis was used as a feature extraction technique. Predictive models were developed using MATLAB. A dataset with 140 data samples and 11 predictor variables (converted to eight principal components) was used. The output variable had two classes - diabetic and not diabetic. A training data set of 98 and 42 samples for training and testing respectively. Two machine learning models - bagged trees and linear SVM were developed. Two validation techniques - 5- fold cross validation and holdout validation were assessed. The highest accuracy of 91% (90.82%, on test data) was obtained by the linear SVM model using both 5-fold cross validation and hold out validation approaches (AUC of 0.908 in both cases).","PeriodicalId":269346,"journal":{"name":"2019 IEEE Healthcare Innovations and Point of Care Technologies, (HI-POCT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Healthcare Innovations and Point of Care Technologies, (HI-POCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HI-POCT45284.2019.8962811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Diabetes is a major chronic disease which impacts all age groups. It has increasing prevalence worldwide. Certain factors increase the chances of diabetes occurrence in individuals. Prediction-based modeling has been used previously to provide a prevention based approach to diabetes. Prediction models have predominantly been based on regression and feature elimination. In this paper, a machine learning-based approach is presented to predict the individual diabetes occurrence based on specific lifestyle, and demographic factors. A publicly available dataset - continuous NHANES, was used. To account for small data size due to missing data and class imbalanced data, certain statistical techniques were applied. Synthetic minority over sampling technique was used via Gower’s distance calculation to avoid class imbalanced data. Additionally, principal component analysis was used as a feature extraction technique. Predictive models were developed using MATLAB. A dataset with 140 data samples and 11 predictor variables (converted to eight principal components) was used. The output variable had two classes - diabetic and not diabetic. A training data set of 98 and 42 samples for training and testing respectively. Two machine learning models - bagged trees and linear SVM were developed. Two validation techniques - 5- fold cross validation and holdout validation were assessed. The highest accuracy of 91% (90.82%, on test data) was obtained by the linear SVM model using both 5-fold cross validation and hold out validation approaches (AUC of 0.908 in both cases).