{"title":"A General Framework for Finding the Optimal Imbalance Ratio in Sampling Methods","authors":"Jialin Peng, Yabin Shao, Longhai Huang","doi":"10.1109/icet55676.2022.9825442","DOIUrl":null,"url":null,"abstract":"How to obtain better classification results from imbalance data has always been a research hot spot in the neighborhood of machine learning and data mining. At present, there are many techniques such as sampling and cost-sensitive learning algorithms to reduce the negative impact of imbalance on classification performance. Some scholars start with the relationship between imbalance ratio and classification performance, hoping to improve classification performance. In this paper, the classification performance is mainly improved by improving the sampling method. Considering that many invalid samples may be synthesized, this paper defines a metric of distribution difference in the sampling process. Then, by analyzing the relationship between the distribution difference and the classification performance, the optimal imbalance ratio in the sampling process can be found. Based on some classic general sampling methods, experimental results on some real data sets prove the effectiveness of the framework.","PeriodicalId":166358,"journal":{"name":"2022 IEEE 5th International Conference on Electronics Technology (ICET)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 5th International Conference on Electronics Technology (ICET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icet55676.2022.9825442","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
How to obtain better classification results from imbalance data has always been a research hot spot in the neighborhood of machine learning and data mining. At present, there are many techniques such as sampling and cost-sensitive learning algorithms to reduce the negative impact of imbalance on classification performance. Some scholars start with the relationship between imbalance ratio and classification performance, hoping to improve classification performance. In this paper, the classification performance is mainly improved by improving the sampling method. Considering that many invalid samples may be synthesized, this paper defines a metric of distribution difference in the sampling process. Then, by analyzing the relationship between the distribution difference and the classification performance, the optimal imbalance ratio in the sampling process can be found. Based on some classic general sampling methods, experimental results on some real data sets prove the effectiveness of the framework.