{"title":"The impact of oversampling with “ubSMOTE” on the performance of machine learning classifiers in prediction of catastrophic health expenditures","authors":"Songul Cinaroglu","doi":"10.1016/j.orhc.2020.100275","DOIUrl":null,"url":null,"abstract":"<div><p>As a common problem in classification tasks, class imbalance degrades the performance of the classifier. Catastrophic out-of-pocket (OOP) health expenditure is a specific example of a rare event faced by very few households. The objective of the present study is to demonstrate a two-step learning approach for modeling highly unbalanced catastrophic OOP health expenditure data. The data are retrieved from the nationally representative Household Budget Survey collected in 2012 by the Turkish Statistical Institute. In total, 9987 households returned valid survey responses. The predictive models are based on eight common risk factors of catastrophic OOP health expenditure. The minority class in the training dataset is oversampled by using a synthetic minority oversampling technique (SMOTE) function, and the original and balanced oversampled training datasets are used to establish the classification models. Logistic regression (LR), random forest (RF) (100 trees), support vector machine (SVM), and neural network (NN) are determined as classifiers. The weighted percentage of households faced with catastrophic OOP health expenditure is 0.14. Balanced oversampling increases the area under the receiver operating characteristic (ROC) curve of LR, RF, SVM, and NN by 0.08%, 0.62%, 0.20%, and 0.23%, respectively. The ROC curve shows NN and RF to be the best classifiers for a balanced oversampled dataset. Identifying a classifier to model highly imbalanced catastrophic OOP health expenditure requires the two-stage procedure of (i) considering a balance between classes and (ii) comparing alternative classifiers. NN and RF are good classifiers in a prediction task with imbalanced catastrophic OOP health expenditure data.</p></div>","PeriodicalId":46320,"journal":{"name":"Operations Research for Health Care","volume":"27 ","pages":"Article 100275"},"PeriodicalIF":1.5000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.orhc.2020.100275","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operations Research for Health Care","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211692320300552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 1
Abstract
As a common problem in classification tasks, class imbalance degrades the performance of the classifier. Catastrophic out-of-pocket (OOP) health expenditure is a specific example of a rare event faced by very few households. The objective of the present study is to demonstrate a two-step learning approach for modeling highly unbalanced catastrophic OOP health expenditure data. The data are retrieved from the nationally representative Household Budget Survey collected in 2012 by the Turkish Statistical Institute. In total, 9987 households returned valid survey responses. The predictive models are based on eight common risk factors of catastrophic OOP health expenditure. The minority class in the training dataset is oversampled by using a synthetic minority oversampling technique (SMOTE) function, and the original and balanced oversampled training datasets are used to establish the classification models. Logistic regression (LR), random forest (RF) (100 trees), support vector machine (SVM), and neural network (NN) are determined as classifiers. The weighted percentage of households faced with catastrophic OOP health expenditure is 0.14. Balanced oversampling increases the area under the receiver operating characteristic (ROC) curve of LR, RF, SVM, and NN by 0.08%, 0.62%, 0.20%, and 0.23%, respectively. The ROC curve shows NN and RF to be the best classifiers for a balanced oversampled dataset. Identifying a classifier to model highly imbalanced catastrophic OOP health expenditure requires the two-stage procedure of (i) considering a balance between classes and (ii) comparing alternative classifiers. NN and RF are good classifiers in a prediction task with imbalanced catastrophic OOP health expenditure data.