The impact of oversampling with “ubSMOTE” on the performance of machine learning classifiers in prediction of catastrophic health expenditures

IF 1.9 Q3 HEALTH CARE SCIENCES & SERVICES

Operations Research for Health Care Pub Date : 2020-12-01 DOI:10.1016/j.orhc.2020.100275

Songul Cinaroglu

{"title":"The impact of oversampling with “ubSMOTE” on the performance of machine learning classifiers in prediction of catastrophic health expenditures","authors":"Songul Cinaroglu","doi":"10.1016/j.orhc.2020.100275","DOIUrl":null,"url":null,"abstract":"<div><p>As a common problem in classification tasks, class imbalance degrades the performance of the classifier. Catastrophic out-of-pocket (OOP) health expenditure is a specific example of a rare event faced by very few households. The objective of the present study is to demonstrate a two-step learning approach for modeling highly unbalanced catastrophic OOP health expenditure data. The data are retrieved from the nationally representative Household Budget Survey collected in 2012 by the Turkish Statistical Institute. In total, 9987 households returned valid survey responses. The predictive models are based on eight common risk factors of catastrophic OOP health expenditure. The minority class in the training dataset is oversampled by using a synthetic minority oversampling technique (SMOTE) function, and the original and balanced oversampled training datasets are used to establish the classification models. Logistic regression (LR), random forest (RF) (100 trees), support vector machine (SVM), and neural network (NN) are determined as classifiers. The weighted percentage of households faced with catastrophic OOP health expenditure is 0.14. Balanced oversampling increases the area under the receiver operating characteristic (ROC) curve of LR, RF, SVM, and NN by 0.08%, 0.62%, 0.20%, and 0.23%, respectively. The ROC curve shows NN and RF to be the best classifiers for a balanced oversampled dataset. Identifying a classifier to model highly imbalanced catastrophic OOP health expenditure requires the two-stage procedure of (i) considering a balance between classes and (ii) comparing alternative classifiers. NN and RF are good classifiers in a prediction task with imbalanced catastrophic OOP health expenditure data.</p></div>","PeriodicalId":46320,"journal":{"name":"Operations Research for Health Care","volume":"27 ","pages":"Article 100275"},"PeriodicalIF":1.9000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.orhc.2020.100275","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operations Research for Health Care","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211692320300552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 1

Abstract

As a common problem in classification tasks, class imbalance degrades the performance of the classifier. Catastrophic out-of-pocket (OOP) health expenditure is a specific example of a rare event faced by very few households. The objective of the present study is to demonstrate a two-step learning approach for modeling highly unbalanced catastrophic OOP health expenditure data. The data are retrieved from the nationally representative Household Budget Survey collected in 2012 by the Turkish Statistical Institute. In total, 9987 households returned valid survey responses. The predictive models are based on eight common risk factors of catastrophic OOP health expenditure. The minority class in the training dataset is oversampled by using a synthetic minority oversampling technique (SMOTE) function, and the original and balanced oversampled training datasets are used to establish the classification models. Logistic regression (LR), random forest (RF) (100 trees), support vector machine (SVM), and neural network (NN) are determined as classifiers. The weighted percentage of households faced with catastrophic OOP health expenditure is 0.14. Balanced oversampling increases the area under the receiver operating characteristic (ROC) curve of LR, RF, SVM, and NN by 0.08%, 0.62%, 0.20%, and 0.23%, respectively. The ROC curve shows NN and RF to be the best classifiers for a balanced oversampled dataset. Identifying a classifier to model highly imbalanced catastrophic OOP health expenditure requires the two-stage procedure of (i) considering a balance between classes and (ii) comparing alternative classifiers. NN and RF are good classifiers in a prediction task with imbalanced catastrophic OOP health expenditure data.

查看原文本刊更多论文

使用“ubSMOTE”进行过采样对机器学习分类器在预测灾难性医疗支出中的性能的影响

分类不平衡是分类任务中的一个常见问题，它会降低分类器的性能。灾难性自费医疗支出是极少数家庭面临的罕见事件的一个具体例子。本研究的目的是展示一种两步学习方法来建模高度不平衡的灾难性面向对象卫生支出数据。数据来自土耳其统计研究所2012年收集的具有全国代表性的家庭预算调查。总共有9987户家庭收到了有效的调查回复。预测模型是基于灾难性OOP卫生支出的8个常见风险因素。利用合成少数派过采样技术(SMOTE)函数对训练数据集中的少数派类进行过采样，并利用原始和平衡过采样训练数据集建立分类模型。确定了逻辑回归(LR)、随机森林(RF)(100棵树)、支持向量机(SVM)和神经网络(NN)作为分类器。面临灾难性OOP卫生支出的家庭加权百分比为0.14。均衡过采样使LR、RF、SVM和NN的受试者工作特征(ROC)曲线下面积分别增加0.08%、0.62%、0.20%和0.23%。ROC曲线显示NN和RF是平衡过采样数据集的最佳分类器。确定一个分类器来模拟高度不平衡的灾难性面向对象卫生支出，需要两个阶段的过程:(i)考虑类别之间的平衡，(ii)比较替代分类器。神经网络和射频在具有不平衡的灾难性面向对象卫生支出数据的预测任务中是很好的分类器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊