Comparison of Error Prediction Methods in Claassification Modeling with CHAID Methods for Balanced Data

UNP Journal of Statistics and Data Science Pub Date : 2023-11-30 DOI:10.24036/ujsds/vol1-iss5/116

Findri Wara Putri, Dodi Vionanda, Atus Amadi putra, Fadhilah Fitri

{"title":"Comparison of Error Prediction Methods in Claassification Modeling with CHAID Methods for Balanced Data","authors":"Findri Wara Putri, Dodi Vionanda, Atus Amadi putra, Fadhilah Fitri","doi":"10.24036/ujsds/vol1-iss5/116","DOIUrl":null,"url":null,"abstract":"Chi-Squared Automatic Interaction Detection (CHAID) is an exploratory method for classifying data by building classification trees. The classification result are displayed in the form of a tree diagram model. After the model is formed, it is necessary to calculate the accuracy of the model. The goal is to see the performance of the model. The accuracy of this model can be determined by calculating the level of prediction error in the model. The error rate prediction method works by dividing data into training data and testing data. There are three methods in the error rate prediction method, such as Leave one out cross validation (LOOCV), Hold out, and k-fold cross validation. These methods have different performance in dividing data into training data and test data, so that each method has advantages and disadvantages. Therefore, a comparison of the three error rate prediction methods was carried out with the aim of determining the appropriate method for the CHAID. This research is included in experimental research and uses simulation data from data generation results in RStudio. This comparison is carried out by considering several factors, namely the marginal probability matrix and different correlations. The comparison results will be observed using a boxplot by looking at the median error rate and lowest variance. This research found that k-fold cross validation is the most suitable error rate prediction method applied to the CHAID method for balanced data.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"38 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"UNP Journal of Statistics and Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24036/ujsds/vol1-iss5/116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Chi-Squared Automatic Interaction Detection (CHAID) is an exploratory method for classifying data by building classification trees. The classification result are displayed in the form of a tree diagram model. After the model is formed, it is necessary to calculate the accuracy of the model. The goal is to see the performance of the model. The accuracy of this model can be determined by calculating the level of prediction error in the model. The error rate prediction method works by dividing data into training data and testing data. There are three methods in the error rate prediction method, such as Leave one out cross validation (LOOCV), Hold out, and k-fold cross validation. These methods have different performance in dividing data into training data and test data, so that each method has advantages and disadvantages. Therefore, a comparison of the three error rate prediction methods was carried out with the aim of determining the appropriate method for the CHAID. This research is included in experimental research and uses simulation data from data generation results in RStudio. This comparison is carried out by considering several factors, namely the marginal probability matrix and different correlations. The comparison results will be observed using a boxplot by looking at the median error rate and lowest variance. This research found that k-fold cross validation is the most suitable error rate prediction method applied to the CHAID method for balanced data.

查看原文本刊更多论文

针对平衡数据的 Claassification 建模中的误差预测方法与 CHAID 方法的比较

Chi-Squared 自动交互检测（CHAID）是一种通过构建分类树对数据进行分类的探索性方法。分类结果以树状图模型的形式显示。模型形成后，有必要计算模型的准确性。目的是了解模型的性能。该模型的准确性可以通过计算模型的预测误差水平来确定。误差率预测法的工作原理是将数据分为训练数据和测试数据。误差率预测法有三种方法，如剔除交叉验证（LOOCV）、保留和 k-fold 交叉验证。这些方法在将数据分为训练数据和测试数据时具有不同的性能，因此每种方法都各有优缺点。因此，对三种误差率预测方法进行了比较，目的是确定适合 CHAID 的方法。本研究包含在实验研究中，并使用 RStudio 中数据生成结果的模拟数据。这种比较是通过考虑几个因素进行的，即边际概率矩阵和不同的相关性。比较结果将通过观察误差率中位数和最小方差，使用方框图进行观察。本研究发现，k-fold 交叉验证是最适合用于平衡数据 CHAID 方法的错误率预测方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

UNP Journal of Statistics and Data Science

自引率

0.00%

发文量