{"title":"基于改进聚类算法和欠采样方法的不平衡数据分类","authors":"Lu Cao, Hong Shen","doi":"10.1109/PDCAT46702.2019.00071","DOIUrl":null,"url":null,"abstract":"Imbalanced classification problem is a hot issue in data mining and machine learning. Traditional classification algorithms are proposed based on some form of symmetry hypothesis of class distribution, whose main purpose is to improve the overall classification performance. It is difficult to obtain ideal classification result when handling imbalanced datasets. In order to improve the classification performance of imbalanced datasets, this paper proposes a cluster-based under-sampling algorithm (CUS) according to the important characteristic of support vector machines (SVM) classification relying on support vector. Firstly, majority class is divided into different clusters using improved clustering by fast search and find of density peaks (CFSFDP) algorithm. The improved clustering algorithm can realize automatic selection of clustering centers, which overcomes the limitation of the original algorithm. Then the minority class and each cluster of the majority class are used to construct training set to get the support vector of each cluster by support vector machine. Retaining support vectors for each cluster and deleting non-support vectors are to construct a new majority class sample points to obtain relatively balanced datasets. Finally, the new datasets are classified by support vector machines and the performance is evaluated by cross validation sets. The experimental results show that CUS algorithm is effective.","PeriodicalId":166126,"journal":{"name":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Imbalanced Data Classification Using Improved Clustering Algorithm and Under-Sampling Method\",\"authors\":\"Lu Cao, Hong Shen\",\"doi\":\"10.1109/PDCAT46702.2019.00071\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Imbalanced classification problem is a hot issue in data mining and machine learning. Traditional classification algorithms are proposed based on some form of symmetry hypothesis of class distribution, whose main purpose is to improve the overall classification performance. It is difficult to obtain ideal classification result when handling imbalanced datasets. In order to improve the classification performance of imbalanced datasets, this paper proposes a cluster-based under-sampling algorithm (CUS) according to the important characteristic of support vector machines (SVM) classification relying on support vector. Firstly, majority class is divided into different clusters using improved clustering by fast search and find of density peaks (CFSFDP) algorithm. The improved clustering algorithm can realize automatic selection of clustering centers, which overcomes the limitation of the original algorithm. Then the minority class and each cluster of the majority class are used to construct training set to get the support vector of each cluster by support vector machine. Retaining support vectors for each cluster and deleting non-support vectors are to construct a new majority class sample points to obtain relatively balanced datasets. Finally, the new datasets are classified by support vector machines and the performance is evaluated by cross validation sets. The experimental results show that CUS algorithm is effective.\",\"PeriodicalId\":166126,\"journal\":{\"name\":\"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDCAT46702.2019.00071\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT46702.2019.00071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Imbalanced Data Classification Using Improved Clustering Algorithm and Under-Sampling Method
Imbalanced classification problem is a hot issue in data mining and machine learning. Traditional classification algorithms are proposed based on some form of symmetry hypothesis of class distribution, whose main purpose is to improve the overall classification performance. It is difficult to obtain ideal classification result when handling imbalanced datasets. In order to improve the classification performance of imbalanced datasets, this paper proposes a cluster-based under-sampling algorithm (CUS) according to the important characteristic of support vector machines (SVM) classification relying on support vector. Firstly, majority class is divided into different clusters using improved clustering by fast search and find of density peaks (CFSFDP) algorithm. The improved clustering algorithm can realize automatic selection of clustering centers, which overcomes the limitation of the original algorithm. Then the minority class and each cluster of the majority class are used to construct training set to get the support vector of each cluster by support vector machine. Retaining support vectors for each cluster and deleting non-support vectors are to construct a new majority class sample points to obtain relatively balanced datasets. Finally, the new datasets are classified by support vector machines and the performance is evaluated by cross validation sets. The experimental results show that CUS algorithm is effective.