基于改进聚类算法和欠采样方法的不平衡数据分类

2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) Pub Date : 2019-12-01 DOI:10.1109/PDCAT46702.2019.00071

Lu Cao, Hong Shen

{"title":"基于改进聚类算法和欠采样方法的不平衡数据分类","authors":"Lu Cao, Hong Shen","doi":"10.1109/PDCAT46702.2019.00071","DOIUrl":null,"url":null,"abstract":"Imbalanced classification problem is a hot issue in data mining and machine learning. Traditional classification algorithms are proposed based on some form of symmetry hypothesis of class distribution, whose main purpose is to improve the overall classification performance. It is difficult to obtain ideal classification result when handling imbalanced datasets. In order to improve the classification performance of imbalanced datasets, this paper proposes a cluster-based under-sampling algorithm (CUS) according to the important characteristic of support vector machines (SVM) classification relying on support vector. Firstly, majority class is divided into different clusters using improved clustering by fast search and find of density peaks (CFSFDP) algorithm. The improved clustering algorithm can realize automatic selection of clustering centers, which overcomes the limitation of the original algorithm. Then the minority class and each cluster of the majority class are used to construct training set to get the support vector of each cluster by support vector machine. Retaining support vectors for each cluster and deleting non-support vectors are to construct a new majority class sample points to obtain relatively balanced datasets. Finally, the new datasets are classified by support vector machines and the performance is evaluated by cross validation sets. The experimental results show that CUS algorithm is effective.","PeriodicalId":166126,"journal":{"name":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Imbalanced Data Classification Using Improved Clustering Algorithm and Under-Sampling Method\",\"authors\":\"Lu Cao, Hong Shen\",\"doi\":\"10.1109/PDCAT46702.2019.00071\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Imbalanced classification problem is a hot issue in data mining and machine learning. Traditional classification algorithms are proposed based on some form of symmetry hypothesis of class distribution, whose main purpose is to improve the overall classification performance. It is difficult to obtain ideal classification result when handling imbalanced datasets. In order to improve the classification performance of imbalanced datasets, this paper proposes a cluster-based under-sampling algorithm (CUS) according to the important characteristic of support vector machines (SVM) classification relying on support vector. Firstly, majority class is divided into different clusters using improved clustering by fast search and find of density peaks (CFSFDP) algorithm. The improved clustering algorithm can realize automatic selection of clustering centers, which overcomes the limitation of the original algorithm. Then the minority class and each cluster of the majority class are used to construct training set to get the support vector of each cluster by support vector machine. Retaining support vectors for each cluster and deleting non-support vectors are to construct a new majority class sample points to obtain relatively balanced datasets. Finally, the new datasets are classified by support vector machines and the performance is evaluated by cross validation sets. The experimental results show that CUS algorithm is effective.\",\"PeriodicalId\":166126,\"journal\":{\"name\":\"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDCAT46702.2019.00071\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT46702.2019.00071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

不平衡分类问题是数据挖掘和机器学习领域的一个热点问题。传统的分类算法是基于某种形式的类分布对称假设提出的，其主要目的是提高整体分类性能。在处理不平衡数据集时，很难得到理想的分类结果。为了提高不平衡数据集的分类性能，本文根据支持向量机(SVM)分类依赖支持向量的重要特点，提出了一种基于聚类的欠采样算法(CUS)。首先，采用快速搜索和发现密度峰值(CFSFDP)算法改进聚类，将大多数类划分为不同的聚类;改进的聚类算法可以实现自动选择聚类中心，克服了原算法的局限性。然后利用少数类和多数类的每个聚类构造训练集，通过支持向量机得到每个聚类的支持向量。保留每个聚类的支持向量，删除非支持向量，构造新的多数类样本点，得到相对平衡的数据集。最后，使用支持向量机对新数据集进行分类，并通过交叉验证集对性能进行评估。实验结果表明，该算法是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Imbalanced Data Classification Using Improved Clustering Algorithm and Under-Sampling Method

Imbalanced classification problem is a hot issue in data mining and machine learning. Traditional classification algorithms are proposed based on some form of symmetry hypothesis of class distribution, whose main purpose is to improve the overall classification performance. It is difficult to obtain ideal classification result when handling imbalanced datasets. In order to improve the classification performance of imbalanced datasets, this paper proposes a cluster-based under-sampling algorithm (CUS) according to the important characteristic of support vector machines (SVM) classification relying on support vector. Firstly, majority class is divided into different clusters using improved clustering by fast search and find of density peaks (CFSFDP) algorithm. The improved clustering algorithm can realize automatic selection of clustering centers, which overcomes the limitation of the original algorithm. Then the minority class and each cluster of the majority class are used to construct training set to get the support vector of each cluster by support vector machine. Retaining support vectors for each cluster and deleting non-support vectors are to construct a new majority class sample points to obtain relatively balanced datasets. Finally, the new datasets are classified by support vector machines and the performance is evaluated by cross validation sets. The experimental results show that CUS algorithm is effective.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)

自引率

0.00%

发文量