Comparison of Error Rate Prediction Methods in Classification Modeling with the CHAID Method for Imbalanced Data

UNP Journal of Statistics and Data Science Pub Date : 2023-08-28 DOI:10.24036/ujsds/vol1-iss4/81

None Seif Adil El-Muslih, None Dodi Vionanda, None Nonong Amalita, None Admi Salma

{"title":"Comparison of Error Rate Prediction Methods in Classification Modeling with the CHAID Method for Imbalanced Data","authors":"None Seif Adil El-Muslih, None Dodi Vionanda, None Nonong Amalita, None Admi Salma","doi":"10.24036/ujsds/vol1-iss4/81","DOIUrl":null,"url":null,"abstract":"CHAID (Chi-Square Automatic Interaction Detection) is one of the classification algorithms in the decision tree method. The classification results are displayed in the form of a tree diagram model. After the model is formed, it is necessary to calculate the accuracy of the model. The aims is to see the performance of the model. The accuracy of this model can be done by calculating the predicted error rate in the model. There are three methods, such as Leave one out cross-validation (LOOCV), Hold-out, and K-fold cross-validation. These methods have different performances in dividing data into training and testing data, so each method has advantages and disadvantages. Imbalanced data is data that has a different number of class observations. In the CHAID method, imbalanced data affects the prediction results. When the data is increasingly imbalanced the prediction result will approach the number of minority classes. Therefore, a comparison was made for the three error rate prediction methods to determine the appropriate method for the CHAID method in imbalanced data. This research is included in experimental research and uses simulated data from the results of generating data in RStudio. This comparison was made by considering several factors, for the marginal opportunity matrix, different correlations, and several observation ratios. The results of the comparison will be observed using a boxplot by looking at the median error rate and the lowest variance. This research finds that K-fold cross-validation is the most suitable error rate prediction method applied to the CHAID method for imbalanced data.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"UNP Journal of Statistics and Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24036/ujsds/vol1-iss4/81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

CHAID (Chi-Square Automatic Interaction Detection) is one of the classification algorithms in the decision tree method. The classification results are displayed in the form of a tree diagram model. After the model is formed, it is necessary to calculate the accuracy of the model. The aims is to see the performance of the model. The accuracy of this model can be done by calculating the predicted error rate in the model. There are three methods, such as Leave one out cross-validation (LOOCV), Hold-out, and K-fold cross-validation. These methods have different performances in dividing data into training and testing data, so each method has advantages and disadvantages. Imbalanced data is data that has a different number of class observations. In the CHAID method, imbalanced data affects the prediction results. When the data is increasingly imbalanced the prediction result will approach the number of minority classes. Therefore, a comparison was made for the three error rate prediction methods to determine the appropriate method for the CHAID method in imbalanced data. This research is included in experimental research and uses simulated data from the results of generating data in RStudio. This comparison was made by considering several factors, for the marginal opportunity matrix, different correlations, and several observation ratios. The results of the comparison will be observed using a boxplot by looking at the median error rate and the lowest variance. This research finds that K-fold cross-validation is the most suitable error rate prediction method applied to the CHAID method for imbalanced data.

查看原文本刊更多论文

不平衡数据分类建模错误率预测方法与CHAID方法的比较

CHAID(卡方自动交互检测)是决策树方法中的一种分类算法。分类结果以树形图模型的形式显示。模型形成后，需要计算模型的精度。目的是查看模型的性能。通过计算模型中的预测错误率来确定模型的准确性。有三种方法，如留一个交叉验证(LOOCV)、保留和K-fold交叉验证。这些方法在将数据划分为训练数据和测试数据方面表现不同，各有优缺点。不平衡数据是具有不同数量的类观测值的数据。在CHAID方法中，不平衡数据会影响预测结果。当数据越来越不平衡时，预测结果将接近少数族裔的数量。因此，对三种错误率预测方法进行比较，以确定CHAID方法在不平衡数据中的适用方法。本研究属于实验研究，使用了在RStudio中生成数据结果的模拟数据。这种比较是通过考虑几个因素来进行的，包括边际机会矩阵、不同的相关性和几个观测比。比较的结果将通过观察中位数错误率和最低方差来使用箱线图观察。本研究发现，K-fold交叉验证是最适合用于CHAID方法的不平衡数据错误率预测方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

UNP Journal of Statistics and Data Science

自引率

0.00%

发文量