CART 对不平衡数据的误差率预测比较

UNP Journal of Statistics and Data Science Pub Date : 2023-11-30 DOI:10.24036/ujsds/vol1-iss5/117

Lifia Zullani, Dodi Vionanda, Syafriandi, Dina Fitria

{"title":"CART 对不平衡数据的误差率预测比较","authors":"Lifia Zullani, Dodi Vionanda, Syafriandi, Dina Fitria","doi":"10.24036/ujsds/vol1-iss5/117","DOIUrl":null,"url":null,"abstract":"CART is one of the tree based classification algorithms. CART is a tree consisting of root nodes, internal nodes, and terminal nodes. The accuracy of the model in CART can be calculated by measuring prediction errors in the model. One common method used to predict error rates is cross-validation. There are three cross-validation algorithms, namely leave one out, hold out, and k-fold cross-validation. These methods have different performance in dividing data into training data and testing data, so there are advantages and disadvantages to each method. Every algorithm has its shortcomings; hold out cannot guarantee that the training set represents the entire dataset, leave one out is very time-consuming and requires significant computation because it has to train the model as many times as there are data points, and k-fold provides longer computation time because the training algorithm must be run k times. In reality, the data often encountered is imbalanced. Imbalanced data refers to data with a different number of observations in each class. In CART, imbalanced data affects the prediction results. This research focuses on comparing error rate prediction methods in the CART model with imbalanced data. The study uses three types of data: univariate, bivariate, and multivariate, obtained from differences in population means and correlations between independent variables. The results obtained indicate that the k-fold algorithm is the most suitable error rate prediction algorithm applied to CART with imbalanced data.","PeriodicalId":220933,"journal":{"name":"UNP Journal of Statistics and Data Science","volume":"50 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison of Error Rate Prediction in CART for Imbalanced Data\",\"authors\":\"Lifia Zullani, Dodi Vionanda, Syafriandi, Dina Fitria\",\"doi\":\"10.24036/ujsds/vol1-iss5/117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"CART is one of the tree based classification algorithms. CART is a tree consisting of root nodes, internal nodes, and terminal nodes. The accuracy of the model in CART can be calculated by measuring prediction errors in the model. One common method used to predict error rates is cross-validation. There are three cross-validation algorithms, namely leave one out, hold out, and k-fold cross-validation. These methods have different performance in dividing data into training data and testing data, so there are advantages and disadvantages to each method. Every algorithm has its shortcomings; hold out cannot guarantee that the training set represents the entire dataset, leave one out is very time-consuming and requires significant computation because it has to train the model as many times as there are data points, and k-fold provides longer computation time because the training algorithm must be run k times. In reality, the data often encountered is imbalanced. Imbalanced data refers to data with a different number of observations in each class. In CART, imbalanced data affects the prediction results. This research focuses on comparing error rate prediction methods in the CART model with imbalanced data. The study uses three types of data: univariate, bivariate, and multivariate, obtained from differences in population means and correlations between independent variables. The results obtained indicate that the k-fold algorithm is the most suitable error rate prediction algorithm applied to CART with imbalanced data.\",\"PeriodicalId\":220933,\"journal\":{\"name\":\"UNP Journal of Statistics and Data Science\",\"volume\":\"50 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"UNP Journal of Statistics and Data Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.24036/ujsds/vol1-iss5/117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"UNP Journal of Statistics and Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24036/ujsds/vol1-iss5/117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

CART 是基于树的分类算法之一。CART 是一棵由根节点、内部节点和终端节点组成的树。CART 模型的准确性可以通过测量模型的预测误差来计算。预测误差率的一种常用方法是交叉验证。交叉验证算法有三种，分别是剔除、保持和 k 倍交叉验证。这些方法在将数据分为训练数据和测试数据时有不同的表现，因此每种方法都有优缺点。每种算法都有其不足之处：hold out 不能保证训练集代表整个数据集；leave one out 非常耗时，需要进行大量计算，因为它必须按照数据点的数量训练模型；而 k-fold 则需要更长的计算时间，因为训练算法必须运行 k 次。在现实中，经常会遇到不平衡数据。不平衡数据指的是每个类别中的观测值数量不同的数据。在 CART 中，不平衡数据会影响预测结果。本研究的重点是比较不平衡数据下 CART 模型的错误率预测方法。研究使用了三种类型的数据：单变量数据、双变量数据和多变量数据，这些数据来自于人口平均值的差异和自变量之间的相关性。研究结果表明，k-fold 算法是最适合用于不平衡数据 CART 的误差率预测算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparison of Error Rate Prediction in CART for Imbalanced Data

CART is one of the tree based classification algorithms. CART is a tree consisting of root nodes, internal nodes, and terminal nodes. The accuracy of the model in CART can be calculated by measuring prediction errors in the model. One common method used to predict error rates is cross-validation. There are three cross-validation algorithms, namely leave one out, hold out, and k-fold cross-validation. These methods have different performance in dividing data into training data and testing data, so there are advantages and disadvantages to each method. Every algorithm has its shortcomings; hold out cannot guarantee that the training set represents the entire dataset, leave one out is very time-consuming and requires significant computation because it has to train the model as many times as there are data points, and k-fold provides longer computation time because the training algorithm must be run k times. In reality, the data often encountered is imbalanced. Imbalanced data refers to data with a different number of observations in each class. In CART, imbalanced data affects the prediction results. This research focuses on comparing error rate prediction methods in the CART model with imbalanced data. The study uses three types of data: univariate, bivariate, and multivariate, obtained from differences in population means and correlations between independent variables. The results obtained indicate that the k-fold algorithm is the most suitable error rate prediction algorithm applied to CART with imbalanced data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

UNP Journal of Statistics and Data Science

自引率

0.00%

发文量