The impact of oversampling with “ubSMOTE” on the performance of machine learning classifiers in prediction of catastrophic health expenditures

IF 1.5 Q3 HEALTH CARE SCIENCES & SERVICES
Songul Cinaroglu
{"title":"The impact of oversampling with “ubSMOTE” on the performance of machine learning classifiers in prediction of catastrophic health expenditures","authors":"Songul Cinaroglu","doi":"10.1016/j.orhc.2020.100275","DOIUrl":null,"url":null,"abstract":"<div><p>As a common problem in classification tasks, class imbalance degrades the performance of the classifier. Catastrophic out-of-pocket (OOP) health expenditure is a specific example of a rare event faced by very few households. The objective of the present study is to demonstrate a two-step learning approach for modeling highly unbalanced catastrophic OOP health expenditure data. The data are retrieved from the nationally representative Household Budget Survey collected in 2012 by the Turkish Statistical Institute. In total, 9987 households returned valid survey responses. The predictive models are based on eight common risk factors of catastrophic OOP health expenditure. The minority class in the training dataset is oversampled by using a synthetic minority oversampling technique (SMOTE) function, and the original and balanced oversampled training datasets are used to establish the classification models. Logistic regression (LR), random forest (RF) (100 trees), support vector machine (SVM), and neural network (NN) are determined as classifiers. The weighted percentage of households faced with catastrophic OOP health expenditure is 0.14. Balanced oversampling increases the area under the receiver operating characteristic (ROC) curve of LR, RF, SVM, and NN by 0.08%, 0.62%, 0.20%, and 0.23%, respectively. The ROC curve shows NN and RF to be the best classifiers for a balanced oversampled dataset. Identifying a classifier to model highly imbalanced catastrophic OOP health expenditure requires the two-stage procedure of (i) considering a balance between classes and (ii) comparing alternative classifiers. NN and RF are good classifiers in a prediction task with imbalanced catastrophic OOP health expenditure data.</p></div>","PeriodicalId":46320,"journal":{"name":"Operations Research for Health Care","volume":"27 ","pages":"Article 100275"},"PeriodicalIF":1.5000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.orhc.2020.100275","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operations Research for Health Care","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211692320300552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 1

Abstract

As a common problem in classification tasks, class imbalance degrades the performance of the classifier. Catastrophic out-of-pocket (OOP) health expenditure is a specific example of a rare event faced by very few households. The objective of the present study is to demonstrate a two-step learning approach for modeling highly unbalanced catastrophic OOP health expenditure data. The data are retrieved from the nationally representative Household Budget Survey collected in 2012 by the Turkish Statistical Institute. In total, 9987 households returned valid survey responses. The predictive models are based on eight common risk factors of catastrophic OOP health expenditure. The minority class in the training dataset is oversampled by using a synthetic minority oversampling technique (SMOTE) function, and the original and balanced oversampled training datasets are used to establish the classification models. Logistic regression (LR), random forest (RF) (100 trees), support vector machine (SVM), and neural network (NN) are determined as classifiers. The weighted percentage of households faced with catastrophic OOP health expenditure is 0.14. Balanced oversampling increases the area under the receiver operating characteristic (ROC) curve of LR, RF, SVM, and NN by 0.08%, 0.62%, 0.20%, and 0.23%, respectively. The ROC curve shows NN and RF to be the best classifiers for a balanced oversampled dataset. Identifying a classifier to model highly imbalanced catastrophic OOP health expenditure requires the two-stage procedure of (i) considering a balance between classes and (ii) comparing alternative classifiers. NN and RF are good classifiers in a prediction task with imbalanced catastrophic OOP health expenditure data.

使用“ubSMOTE”进行过采样对机器学习分类器在预测灾难性医疗支出中的性能的影响
分类不平衡是分类任务中的一个常见问题,它会降低分类器的性能。灾难性自费医疗支出是极少数家庭面临的罕见事件的一个具体例子。本研究的目的是展示一种两步学习方法来建模高度不平衡的灾难性面向对象卫生支出数据。数据来自土耳其统计研究所2012年收集的具有全国代表性的家庭预算调查。总共有9987户家庭收到了有效的调查回复。预测模型是基于灾难性OOP卫生支出的8个常见风险因素。利用合成少数派过采样技术(SMOTE)函数对训练数据集中的少数派类进行过采样,并利用原始和平衡过采样训练数据集建立分类模型。确定了逻辑回归(LR)、随机森林(RF)(100棵树)、支持向量机(SVM)和神经网络(NN)作为分类器。面临灾难性OOP卫生支出的家庭加权百分比为0.14。均衡过采样使LR、RF、SVM和NN的受试者工作特征(ROC)曲线下面积分别增加0.08%、0.62%、0.20%和0.23%。ROC曲线显示NN和RF是平衡过采样数据集的最佳分类器。确定一个分类器来模拟高度不平衡的灾难性面向对象卫生支出,需要两个阶段的过程:(i)考虑类别之间的平衡,(ii)比较替代分类器。神经网络和射频在具有不平衡的灾难性面向对象卫生支出数据的预测任务中是很好的分类器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Operations Research for Health Care
Operations Research for Health Care HEALTH CARE SCIENCES & SERVICES-
CiteScore
3.90
自引率
0.00%
发文量
9
审稿时长
69 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信