{"title":"Clustering Center Optimization under-Sampling Method for Unbalanced Data","authors":"Haitao Li, Mingjie Zhuang","doi":"10.17706/jsw.15.3.74-85","DOIUrl":null,"url":null,"abstract":": When the number of data in one class is significantly larger or less than the data in other class, under learning algorithm for classification, a problem of learning generalization occurs to the specific class and this is called imbalanced data problem. In this paper, a method of under-sampling based on the optimization cluster center selection (BCUSM) is proposed. First of all, the cluster center selection of K-means clustering algorithm is optimized, the initial cluster center is obtained by calculation, instead of random selection. The optimized method is called OICSK-means. And then use it to cluster the negative samples by setting the same number of clusters as positive samples. According to the cosine similarity, select the most similar samples from each cluster with cluster centers as the negative training samples, and a new training set is established with the positive samples. Finally, training with a new training set. This work selected some data from the UCI database of the University of California, Irvine, and used the support vector machine (SVM) classifier for experimental simulation, and compared the classification effects of this method with other four methods such as synthetic oversampling method (SMOTE). The experimental results demonstrate that the BCUSM has certain effectiveness. that of different data set in the experiment, which indicates that BCUSM under-sampling method is more universal than RUS random under-sampling method, and it also reflects that the RUS random under-sampling method easily loses important sample information when the training data has fewer feature attributes, resulting in poor classification. In addition, the SVM's classification effect on the balanced data set is significantly better than the direct SVM classification of the original data set. This shows that SVM is very sensitive to unbalanced data. When no processing is performed on the original training set, the classification accuracy of the SVM for the positive class is greatly reduced, but it also shows that the SVM has better classification performance when the data set is","PeriodicalId":11452,"journal":{"name":"e Informatica Softw. Eng. J.","volume":"51 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"e Informatica Softw. Eng. J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17706/jsw.15.3.74-85","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
: When the number of data in one class is significantly larger or less than the data in other class, under learning algorithm for classification, a problem of learning generalization occurs to the specific class and this is called imbalanced data problem. In this paper, a method of under-sampling based on the optimization cluster center selection (BCUSM) is proposed. First of all, the cluster center selection of K-means clustering algorithm is optimized, the initial cluster center is obtained by calculation, instead of random selection. The optimized method is called OICSK-means. And then use it to cluster the negative samples by setting the same number of clusters as positive samples. According to the cosine similarity, select the most similar samples from each cluster with cluster centers as the negative training samples, and a new training set is established with the positive samples. Finally, training with a new training set. This work selected some data from the UCI database of the University of California, Irvine, and used the support vector machine (SVM) classifier for experimental simulation, and compared the classification effects of this method with other four methods such as synthetic oversampling method (SMOTE). The experimental results demonstrate that the BCUSM has certain effectiveness. that of different data set in the experiment, which indicates that BCUSM under-sampling method is more universal than RUS random under-sampling method, and it also reflects that the RUS random under-sampling method easily loses important sample information when the training data has fewer feature attributes, resulting in poor classification. In addition, the SVM's classification effect on the balanced data set is significantly better than the direct SVM classification of the original data set. This shows that SVM is very sensitive to unbalanced data. When no processing is performed on the original training set, the classification accuracy of the SVM for the positive class is greatly reduced, but it also shows that the SVM has better classification performance when the data set is