类不平衡大数据下的深度学习与阈值

2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) Pub Date : 2019-12-01 DOI:10.1109/ICMLA.2019.00134

Justin M. Johnson, T. Khoshgoftaar

{"title":"类不平衡大数据下的深度学习与阈值","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/ICMLA.2019.00134","DOIUrl":null,"url":null,"abstract":"Class imbalance is a regularly occurring problem in machine learning that has been studied extensively over the last two decades. Various methods for addressing class imbalance have been introduced, including algorithm-level methods, datalevel methods, and hybrid methods. While these methods are well studied using traditional machine learning algorithms, there are relatively few studies that explore their application to deep neural networks. Thresholding, in particular, is rarely discussed in the deep learning with class imbalance literature. This paper addresses this gap by conducting a systematic study on the application of thresholding with deep neural networks using a Big Data Medicare fraud data set. We use random oversampling (ROS), random under-sampling (RUS), and a hybrid ROS-RUS to create 15 training distributions with varying levels of class imbalance. With the fraudulent class size ranging from 0.03%–60%, we identify optimal classification thresholds for each distribution on random validation sets and then score the thresholds on a 20% holdout test set. Through repetition and statistical analysis, confidence intervals show that the default threshold is never optimal when training data is imbalanced. Results also show that the optimal threshold outperforms the default threshold in nearly all cases, and linear models indicate a strong linear relationship between the minority class size and the optimal decision threshold. To the best of our knowledge, this is the first study to provide statistical results that describe optimal classification thresholds for deep neural networks over a range of class distributions.","PeriodicalId":436714,"journal":{"name":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Deep Learning and Thresholding with Class-Imbalanced Big Data\",\"authors\":\"Justin M. Johnson, T. Khoshgoftaar\",\"doi\":\"10.1109/ICMLA.2019.00134\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Class imbalance is a regularly occurring problem in machine learning that has been studied extensively over the last two decades. Various methods for addressing class imbalance have been introduced, including algorithm-level methods, datalevel methods, and hybrid methods. While these methods are well studied using traditional machine learning algorithms, there are relatively few studies that explore their application to deep neural networks. Thresholding, in particular, is rarely discussed in the deep learning with class imbalance literature. This paper addresses this gap by conducting a systematic study on the application of thresholding with deep neural networks using a Big Data Medicare fraud data set. We use random oversampling (ROS), random under-sampling (RUS), and a hybrid ROS-RUS to create 15 training distributions with varying levels of class imbalance. With the fraudulent class size ranging from 0.03%–60%, we identify optimal classification thresholds for each distribution on random validation sets and then score the thresholds on a 20% holdout test set. Through repetition and statistical analysis, confidence intervals show that the default threshold is never optimal when training data is imbalanced. Results also show that the optimal threshold outperforms the default threshold in nearly all cases, and linear models indicate a strong linear relationship between the minority class size and the optimal decision threshold. To the best of our knowledge, this is the first study to provide statistical results that describe optimal classification thresholds for deep neural networks over a range of class distributions.\",\"PeriodicalId\":436714,\"journal\":{\"name\":\"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2019.00134\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2019.00134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

类不平衡是机器学习中经常出现的问题，在过去的二十年里得到了广泛的研究。介绍了解决类不平衡的各种方法，包括算法级方法、数据级方法和混合方法。虽然使用传统的机器学习算法对这些方法进行了很好的研究，但探索其在深度神经网络中的应用的研究相对较少。特别是阈值，在具有阶级不平衡的深度学习文献中很少被讨论。本文通过使用大数据医疗欺诈数据集对阈值与深度神经网络的应用进行系统研究，解决了这一差距。我们使用随机过采样(ROS)、随机欠采样(RUS)和混合ROS-RUS来创建15个具有不同等级不平衡的训练分布。在欺诈类大小范围为0.03%-60%的情况下，我们在随机验证集上为每个分布确定了最佳分类阈值，然后在20%的拒绝测试集上对阈值进行评分。通过重复和统计分析，置信区间表明，当训练数据不平衡时，默认阈值永远不是最优的。结果还表明，在几乎所有情况下，最优阈值都优于默认阈值，线性模型表明，少数类大小与最优决策阈值之间存在很强的线性关系。据我们所知，这是第一个提供统计结果来描述深度神经网络在一系列类别分布上的最佳分类阈值的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep Learning and Thresholding with Class-Imbalanced Big Data

Class imbalance is a regularly occurring problem in machine learning that has been studied extensively over the last two decades. Various methods for addressing class imbalance have been introduced, including algorithm-level methods, datalevel methods, and hybrid methods. While these methods are well studied using traditional machine learning algorithms, there are relatively few studies that explore their application to deep neural networks. Thresholding, in particular, is rarely discussed in the deep learning with class imbalance literature. This paper addresses this gap by conducting a systematic study on the application of thresholding with deep neural networks using a Big Data Medicare fraud data set. We use random oversampling (ROS), random under-sampling (RUS), and a hybrid ROS-RUS to create 15 training distributions with varying levels of class imbalance. With the fraudulent class size ranging from 0.03%–60%, we identify optimal classification thresholds for each distribution on random validation sets and then score the thresholds on a 20% holdout test set. Through repetition and statistical analysis, confidence intervals show that the default threshold is never optimal when training data is imbalanced. Results also show that the optimal threshold outperforms the default threshold in nearly all cases, and linear models indicate a strong linear relationship between the minority class size and the optimal decision threshold. To the best of our knowledge, this is the first study to provide statistical results that describe optimal classification thresholds for deep neural networks over a range of class distributions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)

自引率

0.00%

发文量