{"title":"Research on Unbalanced Data Processing Algorithm Base Tomeklinks-Smote","authors":"Ai-hua Li, Peng Zhang","doi":"10.1145/3430199.3430222","DOIUrl":null,"url":null,"abstract":"Traditional machine learning algorithms tend to bias \"majority\" for classification of unbalanced data, which makes the classification accuracy of less-class samples lower. In order to improve the classification accuracy of less-class samples in the data set, this paper proposes a method based on imbalance TLS algorithm for data processing. This method first deletes duplicate samples in the original data set, and then deletes the boundary samples and noise samples that are Tomek links pairs' in the majority class and the minority class in the data set through the Tomek links undersampling algorithm and then oversampling the minority class samples with the smoth algorithm. After the obtained large-class sample and the obtained small-class sample are approximately balanced. On several commonly used data sets in the UCI database, compare with the original data set, the tradition al SMOTE oversampling only for small samples, and the traditional Tomek link under- sampling method for large samples only, Use SVM, logistic regression, Multilayer Perceptron (Neural Network) and random forest for classification. The experiment proves that the sampling method adopted in this paper can indeed improve the recognition of small class samples.","PeriodicalId":361892,"journal":{"name":"International Conference on Artificial Intelligence and Pattern Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Artificial Intelligence and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3430199.3430222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Traditional machine learning algorithms tend to bias "majority" for classification of unbalanced data, which makes the classification accuracy of less-class samples lower. In order to improve the classification accuracy of less-class samples in the data set, this paper proposes a method based on imbalance TLS algorithm for data processing. This method first deletes duplicate samples in the original data set, and then deletes the boundary samples and noise samples that are Tomek links pairs' in the majority class and the minority class in the data set through the Tomek links undersampling algorithm and then oversampling the minority class samples with the smoth algorithm. After the obtained large-class sample and the obtained small-class sample are approximately balanced. On several commonly used data sets in the UCI database, compare with the original data set, the tradition al SMOTE oversampling only for small samples, and the traditional Tomek link under- sampling method for large samples only, Use SVM, logistic regression, Multilayer Perceptron (Neural Network) and random forest for classification. The experiment proves that the sampling method adopted in this paper can indeed improve the recognition of small class samples.