Deep Learning and Data Sampling with Imbalanced Big Data

2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI) Pub Date : 2019-07-01 DOI:10.1109/IRI.2019.00038

Justin M. Johnson, T. Khoshgoftaar

{"title":"Deep Learning and Data Sampling with Imbalanced Big Data","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/IRI.2019.00038","DOIUrl":null,"url":null,"abstract":"This study evaluates the use of deep learning and data sampling on a class-imbalanced Big Data problem, i.e. Medicare fraud detection. Medicare offers affordable health insurance to the elderly population and serves more than 15% of the United States population. To increase transparency and help reduce fraud, the Centers for Medicare and Medicaid Services (CMS) have made several data sets publicly available for analysis. Our research group has conducted several studies using CMS data and traditional machine learning algorithms (non-deep learning), but challenges associated with severe class imbalance leave room for improvement. These previous studies serve as baselines as we employ deep neural networks with various data-sampling techniques to determine the efficacy of deep learning in addressing class imbalance. Random over-sampling (ROS), random under-sampling (RUS), and combinations of the two (ROS-RUS) are applied to study how varying levels of class imbalance impact model training and performance. Classwise performance is maximized by identifying optimal decision thresholds, and a strong linear relationship between minority class size and optimal threshold is observed. Results show that ROS significantly outperforms RUS, combining RUS and ROS both maximizes performance and efficiency with a 4 x speedup in training time, and the default threshold of 0.5 is never optimal when training data is imbalanced. To the best of our knowledge, this is the first study to provide statistical results comparing ROS, RUS, and ROS-RUS deep learning methods across a range of class distributions. Additional contributions include a unique analysis of thresholding as it relates to the minority class size and state-of-the-art performance on the given fraud detection task.","PeriodicalId":295028,"journal":{"name":"2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2019.00038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

This study evaluates the use of deep learning and data sampling on a class-imbalanced Big Data problem, i.e. Medicare fraud detection. Medicare offers affordable health insurance to the elderly population and serves more than 15% of the United States population. To increase transparency and help reduce fraud, the Centers for Medicare and Medicaid Services (CMS) have made several data sets publicly available for analysis. Our research group has conducted several studies using CMS data and traditional machine learning algorithms (non-deep learning), but challenges associated with severe class imbalance leave room for improvement. These previous studies serve as baselines as we employ deep neural networks with various data-sampling techniques to determine the efficacy of deep learning in addressing class imbalance. Random over-sampling (ROS), random under-sampling (RUS), and combinations of the two (ROS-RUS) are applied to study how varying levels of class imbalance impact model training and performance. Classwise performance is maximized by identifying optimal decision thresholds, and a strong linear relationship between minority class size and optimal threshold is observed. Results show that ROS significantly outperforms RUS, combining RUS and ROS both maximizes performance and efficiency with a 4 x speedup in training time, and the default threshold of 0.5 is never optimal when training data is imbalanced. To the best of our knowledge, this is the first study to provide statistical results comparing ROS, RUS, and ROS-RUS deep learning methods across a range of class distributions. Additional contributions include a unique analysis of thresholding as it relates to the minority class size and state-of-the-art performance on the given fraud detection task.

查看原文本刊更多论文

不平衡大数据下的深度学习与数据采样

本研究评估了深度学习和数据采样在一个类别不平衡的大数据问题上的应用，即医疗欺诈检测。医疗保险为老年人提供负担得起的健康保险，服务于超过15%的美国人口。为了提高透明度和帮助减少欺诈，医疗保险和医疗补助服务中心(CMS)已经公开了几个数据集供分析。我们的研究小组已经使用CMS数据和传统的机器学习算法(非深度学习)进行了几项研究，但与严重的阶级不平衡相关的挑战留下了改进的空间。这些先前的研究作为基线，我们使用深度神经网络和各种数据采样技术来确定深度学习在解决阶级不平衡方面的有效性。采用随机过采样(ROS)、随机欠采样(RUS)以及两者的组合(ROS-RUS)来研究不同程度的类失衡对模型训练和性能的影响。通过识别最优决策阈值，可以最大限度地提高Classwise性能，并且观察到少数类大小与最优阈值之间存在很强的线性关系。结果表明，ROS明显优于RUS，将RUS和ROS结合使用可以最大限度地提高性能和效率，训练时间加快4倍，当训练数据不平衡时，默认阈值0.5永远不是最优的。据我们所知，这是第一个在一系列班级分布中比较ROS、RUS和ROS-RUS深度学习方法的统计结果的研究。其他贡献包括对阈值的独特分析，因为它与少数类大小和给定欺诈检测任务的最先进性能有关。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)

自引率

0.00%

发文量