Comparasion of Error Rate Prediction Methods of C4.5 Algorithm for Balanced Data

UNP Journal of Statistics and Data Science Pub Date : 2023-08-28 DOI:10.24036/ujsds/vol1-iss4/74

None Ichlas Djuazva, None Dodi Vionanda, None Nonong Amalita, None Zilrahmi

{"title":"Comparasion of Error Rate Prediction Methods of C4.5 Algorithm for Balanced Data","authors":"None Ichlas Djuazva, None Dodi Vionanda, None Nonong Amalita, None Zilrahmi","doi":"10.24036/ujsds/vol1-iss4/74","DOIUrl":null,"url":null,"abstract":"C45 is a highly effective decision tree algorithm widely used for classification purposes. Compared to CHAID, Cart, and ID3, C4.5 generates decision trees that are easier to understand and does so in a faster manner. This is due to C4.5 selecting attributes based on their information content during each stage of the process. After generating the decision tree model, its performance needs to be evaluated. One commonly used method is the prediction error rate, which assesses the model's performance. The prediction error rate consists of two approaches: the train error rate, which employs the same data for both building and testing the model, potentially leading to overfitting, and the test error rate, which divides the data into training and testing sets. The test error rate includes cross validation techniques such as Leave One Out Cross Validation (LOOCV), Hold Out (HO), and k-folds cross validation. Considering these factors, this research focuses on comparing the three cross-validation methods for predicting error rates applied to the C4.5 algorithm. The study utilizes artificially generated data with a normal distribution, including univariate, bivariate, and multivariate datasets with various combinations of mean differences and correlations. Different correlation structures are applied between two relevant variables and between relevant and irrelevant variables in the bivariate and multivariate data, including three correlation levels: no correlation, moderate correlation, and high correlation. This research findings that k-folds cross validation is the most suitable cross validation method to apply to C4.5.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"UNP Journal of Statistics and Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24036/ujsds/vol1-iss4/74","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

C45 is a highly effective decision tree algorithm widely used for classification purposes. Compared to CHAID, Cart, and ID3, C4.5 generates decision trees that are easier to understand and does so in a faster manner. This is due to C4.5 selecting attributes based on their information content during each stage of the process. After generating the decision tree model, its performance needs to be evaluated. One commonly used method is the prediction error rate, which assesses the model's performance. The prediction error rate consists of two approaches: the train error rate, which employs the same data for both building and testing the model, potentially leading to overfitting, and the test error rate, which divides the data into training and testing sets. The test error rate includes cross validation techniques such as Leave One Out Cross Validation (LOOCV), Hold Out (HO), and k-folds cross validation. Considering these factors, this research focuses on comparing the three cross-validation methods for predicting error rates applied to the C4.5 algorithm. The study utilizes artificially generated data with a normal distribution, including univariate, bivariate, and multivariate datasets with various combinations of mean differences and correlations. Different correlation structures are applied between two relevant variables and between relevant and irrelevant variables in the bivariate and multivariate data, including three correlation levels: no correlation, moderate correlation, and high correlation. This research findings that k-folds cross validation is the most suitable cross validation method to apply to C4.5.

查看原文本刊更多论文

平衡数据C4.5算法错误率预测方法比较

C45是一种高效的决策树算法，广泛应用于分类领域。与CHAID、Cart和ID3相比，C4.5生成的决策树更容易理解，而且生成的速度更快。这是由于C4.5在流程的每个阶段根据属性的信息内容选择属性。在生成决策树模型后，需要对其性能进行评估。一种常用的方法是预测错误率，它用来评估模型的性能。预测错误率包括两种方法:训练错误率，它使用相同的数据来构建和测试模型，可能导致过拟合;测试错误率，它将数据分为训练集和测试集。测试错误率包括交叉验证技术，如留一交叉验证(LOOCV)、留一交叉验证(HO)和k-fold交叉验证。考虑到这些因素，本研究的重点是比较三种交叉验证方法预测误差率应用于C4.5算法。该研究利用人工生成的正态分布数据，包括单变量、双变量和多变量数据集，这些数据集具有各种平均差异和相关性的组合。在双变量和多变量数据中，两个相关变量之间、相关变量和不相关变量之间采用不同的相关结构，包括不相关、中等相关和高相关三个相关水平。本研究发现，k-fold交叉验证是最适合应用于C4.5的交叉验证方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

UNP Journal of Statistics and Data Science

自引率

0.00%

发文量