Research on Unbalanced Data Processing Algorithm Base Tomeklinks-Smote

International Conference on Artificial Intelligence and Pattern Recognition Pub Date : 1900-01-01 DOI:10.1145/3430199.3430222

Ai-hua Li, Peng Zhang

{"title":"Research on Unbalanced Data Processing Algorithm Base Tomeklinks-Smote","authors":"Ai-hua Li, Peng Zhang","doi":"10.1145/3430199.3430222","DOIUrl":null,"url":null,"abstract":"Traditional machine learning algorithms tend to bias \"majority\" for classification of unbalanced data, which makes the classification accuracy of less-class samples lower. In order to improve the classification accuracy of less-class samples in the data set, this paper proposes a method based on imbalance TLS algorithm for data processing. This method first deletes duplicate samples in the original data set, and then deletes the boundary samples and noise samples that are Tomek links pairs' in the majority class and the minority class in the data set through the Tomek links undersampling algorithm and then oversampling the minority class samples with the smoth algorithm. After the obtained large-class sample and the obtained small-class sample are approximately balanced. On several commonly used data sets in the UCI database, compare with the original data set, the tradition al SMOTE oversampling only for small samples, and the traditional Tomek link under- sampling method for large samples only, Use SVM, logistic regression, Multilayer Perceptron (Neural Network) and random forest for classification. The experiment proves that the sampling method adopted in this paper can indeed improve the recognition of small class samples.","PeriodicalId":361892,"journal":{"name":"International Conference on Artificial Intelligence and Pattern Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Artificial Intelligence and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3430199.3430222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Traditional machine learning algorithms tend to bias "majority" for classification of unbalanced data, which makes the classification accuracy of less-class samples lower. In order to improve the classification accuracy of less-class samples in the data set, this paper proposes a method based on imbalance TLS algorithm for data processing. This method first deletes duplicate samples in the original data set, and then deletes the boundary samples and noise samples that are Tomek links pairs' in the majority class and the minority class in the data set through the Tomek links undersampling algorithm and then oversampling the minority class samples with the smoth algorithm. After the obtained large-class sample and the obtained small-class sample are approximately balanced. On several commonly used data sets in the UCI database, compare with the original data set, the tradition al SMOTE oversampling only for small samples, and the traditional Tomek link under- sampling method for large samples only, Use SVM, logistic regression, Multilayer Perceptron (Neural Network) and random forest for classification. The experiment proves that the sampling method adopted in this paper can indeed improve the recognition of small class samples.

查看原文本刊更多论文

基于tomeklinksmote的不平衡数据处理算法研究

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Artificial Intelligence and Pattern Recognition

自引率

0.00%

发文量