{"title":"Studying the Classification Accuracy Performance when Representation is Changed on Several Classifier Techniques","authors":"Ehab A. Omer A. Omer, Wisam H. Benamer","doi":"10.1145/3069593.3069597","DOIUrl":null,"url":null,"abstract":"Introduction: During the process of building a predictive data mining module achieving the highest accuracy is major concern by all researchers. Studying the impact of data representation on the performance of classification accuracy is essential. Recent researches travel among classifiers techniques looking for suitable and higher classification accuracy to build strong modules. Adding extra dimensional by focusing on the reflects that data representation might have on the classification accuracy data mining predictive techniques is the ultimate goal of this research. Methods: In this research seven different data representations were performed on several classifier techniques. These representations were AS_IS representation and three from the binary section and three from normalization section. The binary section included simple binary representation, flag representation and thermometer representation while the normalization section included min max normalization, sigmoidal normalization and standard deviation normalization. These seven representations were applied on eight classifiers Neural Network, Logistic Regression, K nearest Neighbor, Support Vector Machine, Classification Tree, Naive Bayesian, Rule based and Random Forest Decision Tree. Moreover, two datasets have been used for testing the performance of classification accuracy, namely Wisconsin Breast Cancer and German Credit and these two datasets have Boolean target class. Results: The fourteen data representations were raised from two datasets Wisconsin Breast Cancer and German Credit with seven different data representations for each. These data representations were performed on several classifier techniques using Orange software. The results achieved showed variation of the performance among all classifier in classification accuracy. Excluding Naive Bayesian which had over 60 % different from the lowest to the highest accuracy, all other classifier techniques had diverging on classification accuracy around 4.2%.","PeriodicalId":383937,"journal":{"name":"Proceedings of the International Conference on High Performance Compilation, Computing and Communications","volume":"132 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Compilation, Computing and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3069593.3069597","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Introduction: During the process of building a predictive data mining module achieving the highest accuracy is major concern by all researchers. Studying the impact of data representation on the performance of classification accuracy is essential. Recent researches travel among classifiers techniques looking for suitable and higher classification accuracy to build strong modules. Adding extra dimensional by focusing on the reflects that data representation might have on the classification accuracy data mining predictive techniques is the ultimate goal of this research. Methods: In this research seven different data representations were performed on several classifier techniques. These representations were AS_IS representation and three from the binary section and three from normalization section. The binary section included simple binary representation, flag representation and thermometer representation while the normalization section included min max normalization, sigmoidal normalization and standard deviation normalization. These seven representations were applied on eight classifiers Neural Network, Logistic Regression, K nearest Neighbor, Support Vector Machine, Classification Tree, Naive Bayesian, Rule based and Random Forest Decision Tree. Moreover, two datasets have been used for testing the performance of classification accuracy, namely Wisconsin Breast Cancer and German Credit and these two datasets have Boolean target class. Results: The fourteen data representations were raised from two datasets Wisconsin Breast Cancer and German Credit with seven different data representations for each. These data representations were performed on several classifier techniques using Orange software. The results achieved showed variation of the performance among all classifier in classification accuracy. Excluding Naive Bayesian which had over 60 % different from the lowest to the highest accuracy, all other classifier techniques had diverging on classification accuracy around 4.2%.
在构建预测数据挖掘模块的过程中,实现最高的准确性是所有研究人员关注的主要问题。研究数据表示对分类精度性能的影响至关重要。近年来的研究在各种分类器技术之间穿梭,寻找合适的、更高的分类精度来构建强模块。通过关注数据表示对数据挖掘预测技术分类精度的影响来增加额外的维度是本研究的最终目标。方法:在本研究中,对几种分类器技术进行了七种不同的数据表示。这些表示是AS_IS表示,三个来自二进制部分,三个来自规范化部分。二值化部分包括简单二值化、标志化和温度计化,归一化部分包括最小最大值归一化、s型归一化和标准差归一化。这七种表示分别应用于神经网络、逻辑回归、K近邻、支持向量机、分类树、朴素贝叶斯、基于规则和随机森林决策树等8种分类器上。此外,我们还使用了两个数据集来测试分类精度的性能,分别是Wisconsin Breast Cancer和German Credit,这两个数据集都有布尔目标类。结果:14个数据表示来自两个数据集威斯康星乳腺癌和德国信贷,每个数据集有7个不同的数据表示。使用Orange软件在几种分类器技术上执行这些数据表示。结果表明,不同分类器的分类精度存在差异。除了朴素贝叶斯(从最低准确率到最高准确率的差异超过60%),所有其他分类器技术的分类准确率在4.2%左右。