John T. Hancock, Justin M. Johnson, T. Khoshgoftaar
{"title":"不平衡数据分类阈值优化的比较方法","authors":"John T. Hancock, Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/CIC56439.2022.00028","DOIUrl":null,"url":null,"abstract":"For the practical application of a classifier, it is necessary to select an optimal output probability threshold to obtain the best classification results. There are many criteria one may employ to select a threshold. However, selecting a threshold will often involve trading off performance in terms of one metric for performance in terms of another metric. In our literature review of studies involving selecting thresholds to optimize classification of imbalanced data, we find there is an opportunity to expand on previous work for an in-depth study of threshold selection. Our contribution is to present a systematic method for selecting the best threshold value for a given classification task and its desired performance constraints. Just as a machine learning algorithm is optimized on some training data set, we demonstrate how a user-defined set of performance metrics can be utilized to optimize the classification threshold. In this study we use four popular metrics to optimize thresholds: precision, Matthews’ Correlation Coefficient, f-measure and geometric mean of true positive rate, and true negative rate. Moreover, we compare classification results for thresholds optimized for these metrics with the commonly used default threshold of 0.5, and the prior probability of the positive class (also known as the minority to majority class ratio). Our results show that other thresholds handily outperform the default threshold of 0.5. Moreover, we show that the positive class prior probability is a good benchmark for finding classification thresholds that perform well in terms of multiple metrics.","PeriodicalId":170721,"journal":{"name":"2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Comparative Approach to Threshold Optimization for Classifying Imbalanced Data\",\"authors\":\"John T. Hancock, Justin M. Johnson, T. Khoshgoftaar\",\"doi\":\"10.1109/CIC56439.2022.00028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For the practical application of a classifier, it is necessary to select an optimal output probability threshold to obtain the best classification results. There are many criteria one may employ to select a threshold. However, selecting a threshold will often involve trading off performance in terms of one metric for performance in terms of another metric. In our literature review of studies involving selecting thresholds to optimize classification of imbalanced data, we find there is an opportunity to expand on previous work for an in-depth study of threshold selection. Our contribution is to present a systematic method for selecting the best threshold value for a given classification task and its desired performance constraints. Just as a machine learning algorithm is optimized on some training data set, we demonstrate how a user-defined set of performance metrics can be utilized to optimize the classification threshold. In this study we use four popular metrics to optimize thresholds: precision, Matthews’ Correlation Coefficient, f-measure and geometric mean of true positive rate, and true negative rate. Moreover, we compare classification results for thresholds optimized for these metrics with the commonly used default threshold of 0.5, and the prior probability of the positive class (also known as the minority to majority class ratio). Our results show that other thresholds handily outperform the default threshold of 0.5. Moreover, we show that the positive class prior probability is a good benchmark for finding classification thresholds that perform well in terms of multiple metrics.\",\"PeriodicalId\":170721,\"journal\":{\"name\":\"2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC)\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIC56439.2022.00028\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIC56439.2022.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Comparative Approach to Threshold Optimization for Classifying Imbalanced Data
For the practical application of a classifier, it is necessary to select an optimal output probability threshold to obtain the best classification results. There are many criteria one may employ to select a threshold. However, selecting a threshold will often involve trading off performance in terms of one metric for performance in terms of another metric. In our literature review of studies involving selecting thresholds to optimize classification of imbalanced data, we find there is an opportunity to expand on previous work for an in-depth study of threshold selection. Our contribution is to present a systematic method for selecting the best threshold value for a given classification task and its desired performance constraints. Just as a machine learning algorithm is optimized on some training data set, we demonstrate how a user-defined set of performance metrics can be utilized to optimize the classification threshold. In this study we use four popular metrics to optimize thresholds: precision, Matthews’ Correlation Coefficient, f-measure and geometric mean of true positive rate, and true negative rate. Moreover, we compare classification results for thresholds optimized for these metrics with the commonly used default threshold of 0.5, and the prior probability of the positive class (also known as the minority to majority class ratio). Our results show that other thresholds handily outperform the default threshold of 0.5. Moreover, we show that the positive class prior probability is a good benchmark for finding classification thresholds that perform well in terms of multiple metrics.