Clustering Center Optimization under-Sampling Method for Unbalanced Data

e Informatica Softw. Eng. J. Pub Date : 2020-05-01 DOI:10.17706/jsw.15.3.74-85

Haitao Li, Mingjie Zhuang

{"title":"Clustering Center Optimization under-Sampling Method for Unbalanced Data","authors":"Haitao Li, Mingjie Zhuang","doi":"10.17706/jsw.15.3.74-85","DOIUrl":null,"url":null,"abstract":": When the number of data in one class is significantly larger or less than the data in other class, under learning algorithm for classification, a problem of learning generalization occurs to the specific class and this is called imbalanced data problem. In this paper, a method of under-sampling based on the optimization cluster center selection (BCUSM) is proposed. First of all, the cluster center selection of K-means clustering algorithm is optimized, the initial cluster center is obtained by calculation, instead of random selection. The optimized method is called OICSK-means. And then use it to cluster the negative samples by setting the same number of clusters as positive samples. According to the cosine similarity, select the most similar samples from each cluster with cluster centers as the negative training samples, and a new training set is established with the positive samples. Finally, training with a new training set. This work selected some data from the UCI database of the University of California, Irvine, and used the support vector machine (SVM) classifier for experimental simulation, and compared the classification effects of this method with other four methods such as synthetic oversampling method (SMOTE). The experimental results demonstrate that the BCUSM has certain effectiveness. that of different data set in the experiment, which indicates that BCUSM under-sampling method is more universal than RUS random under-sampling method, and it also reflects that the RUS random under-sampling method easily loses important sample information when the training data has fewer feature attributes, resulting in poor classification. In addition, the SVM's classification effect on the balanced data set is significantly better than the direct SVM classification of the original data set. This shows that SVM is very sensitive to unbalanced data. When no processing is performed on the original training set, the classification accuracy of the SVM for the positive class is greatly reduced, but it also shows that the SVM has better classification performance when the data set is","PeriodicalId":11452,"journal":{"name":"e Informatica Softw. Eng. J.","volume":"51 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"e Informatica Softw. Eng. J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17706/jsw.15.3.74-85","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

: When the number of data in one class is significantly larger or less than the data in other class, under learning algorithm for classification, a problem of learning generalization occurs to the specific class and this is called imbalanced data problem. In this paper, a method of under-sampling based on the optimization cluster center selection (BCUSM) is proposed. First of all, the cluster center selection of K-means clustering algorithm is optimized, the initial cluster center is obtained by calculation, instead of random selection. The optimized method is called OICSK-means. And then use it to cluster the negative samples by setting the same number of clusters as positive samples. According to the cosine similarity, select the most similar samples from each cluster with cluster centers as the negative training samples, and a new training set is established with the positive samples. Finally, training with a new training set. This work selected some data from the UCI database of the University of California, Irvine, and used the support vector machine (SVM) classifier for experimental simulation, and compared the classification effects of this method with other four methods such as synthetic oversampling method (SMOTE). The experimental results demonstrate that the BCUSM has certain effectiveness. that of different data set in the experiment, which indicates that BCUSM under-sampling method is more universal than RUS random under-sampling method, and it also reflects that the RUS random under-sampling method easily loses important sample information when the training data has fewer feature attributes, resulting in poor classification. In addition, the SVM's classification effect on the balanced data set is significantly better than the direct SVM classification of the original data set. This shows that SVM is very sensitive to unbalanced data. When no processing is performed on the original training set, the classification accuracy of the SVM for the positive class is greatly reduced, but it also shows that the SVM has better classification performance when the data set is

查看原文本刊更多论文

不平衡数据的聚类中心优化欠采样方法

:在分类的学习算法中，当一个类的数据量明显大于或小于另一个类的数据量时，就会对特定的类产生学习泛化的问题，称为数据不平衡问题。提出了一种基于优化聚类中心选择(BCUSM)的欠采样方法。首先，对K-means聚类算法的聚类中心选择进行了优化，通过计算得到初始聚类中心，而不是随机选择。优化后的方法称为OICSK-means。然后通过设置与正样本相同数量的簇来对负样本进行聚类。根据余弦相似度，从每个具有聚类中心的聚类中选取最相似的样本作为负训练样本，用正样本建立新的训练集。最后，使用新的训练集进行训练。本工作从加州大学欧文分校的UCI数据库中选取部分数据，使用支持向量机(SVM)分类器进行实验模拟，并将该方法的分类效果与合成过采样法(SMOTE)等其他四种方法进行比较。实验结果表明，BCUSM具有一定的有效性。这表明BCUSM欠采样方法比RUS随机欠采样方法更具普适性，也反映了RUS随机欠采样方法在训练数据特征属性较少时容易丢失重要样本信息，导致分类效果较差。此外，SVM对平衡数据集的分类效果明显优于对原始数据集的直接SVM分类。这说明支持向量机对不平衡数据非常敏感。当对原始训练集不进行任何处理时，支持向量机对正类的分类精度大大降低，但也说明支持向量机在数据集为时具有更好的分类性能

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

e Informatica Softw. Eng. J.

自引率

0.00%

发文量