类不平衡对多类高维类预测的影响

Advances in Methodology and Statistics Pub Date : 2012-01-01 DOI:10.51936/grxm1445

R. Blagus, L. Lusa

{"title":"类不平衡对多类高维类预测的影响","authors":"R. Blagus, L. Lusa","doi":"10.51936/grxm1445","DOIUrl":null,"url":null,"abstract":"The goal of multi-class supervised classification is to develop a rule that accurately predicts the class membership of new samples when the number of classes is larger than two. In this paper we consider high-dimensional class-imbalanced data: the number of variables greatly exceeds the number of samples and the number of samples in each class is not equal. We focus on Friedman's one-versus-one approach for three-class problems and show how its class probabilities depend on the class probabilities from the binary classification sub-problems. We further explore its performance using diagonal linear discriminant analysis (DLDA) as a base classifier and compare its performance with multi-class DLDA, using simulated and real data. Our results show that the class-imbalance has a significant effect on the classification results: the classification is biased towards the majority class as in the two-class problems and the problem is magnified when the number of variables is large. The amount of the bias depends also, jointly, on the magnitude of the differences between the classes and on the sample size: the bias diminishes when the difference between the classes is larger or the sample size is increased. Also variable selection plays an important role in the class-imbalance problem and the most effective strategy depends on the type of differences that exist between classes. DLDA seems to be among the least sensible classifiers to class-imbalance and its use is recommended also for multi-class problems. Whenever possible the experiments should be planned using balanced data in order to avoid the class-imbalance problem.","PeriodicalId":242585,"journal":{"name":"Advances in Methodology and Statistics","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Impact of class-imbalance on multi-class high-dimensional class prediction\",\"authors\":\"R. Blagus, L. Lusa\",\"doi\":\"10.51936/grxm1445\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The goal of multi-class supervised classification is to develop a rule that accurately predicts the class membership of new samples when the number of classes is larger than two. In this paper we consider high-dimensional class-imbalanced data: the number of variables greatly exceeds the number of samples and the number of samples in each class is not equal. We focus on Friedman's one-versus-one approach for three-class problems and show how its class probabilities depend on the class probabilities from the binary classification sub-problems. We further explore its performance using diagonal linear discriminant analysis (DLDA) as a base classifier and compare its performance with multi-class DLDA, using simulated and real data. Our results show that the class-imbalance has a significant effect on the classification results: the classification is biased towards the majority class as in the two-class problems and the problem is magnified when the number of variables is large. The amount of the bias depends also, jointly, on the magnitude of the differences between the classes and on the sample size: the bias diminishes when the difference between the classes is larger or the sample size is increased. Also variable selection plays an important role in the class-imbalance problem and the most effective strategy depends on the type of differences that exist between classes. DLDA seems to be among the least sensible classifiers to class-imbalance and its use is recommended also for multi-class problems. Whenever possible the experiments should be planned using balanced data in order to avoid the class-imbalance problem.\",\"PeriodicalId\":242585,\"journal\":{\"name\":\"Advances in Methodology and Statistics\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in Methodology and Statistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.51936/grxm1445\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Methodology and Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.51936/grxm1445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

多类监督分类的目标是开发一种规则，当类数大于2时，能够准确地预测新样本的类隶属度。本文考虑高维类不平衡数据:变量数量大大超过样本数量，且每个类的样本数量不相等。我们将重点关注Friedman的三类问题的1对1方法，并展示其类概率如何依赖于二元分类子问题的类概率。我们使用对角线性判别分析(DLDA)作为基本分类器进一步探索其性能，并使用模拟和真实数据将其与多类DLDA的性能进行比较。我们的研究结果表明，类不平衡对分类结果有显著影响:与两类问题一样，分类偏向多数类，当变量数量较大时，问题被放大。偏差的大小还共同取决于类别之间差异的大小和样本量:当类别之间的差异较大或样本量增加时，偏差会减弱。变量选择在班级失衡问题中也起着重要作用，最有效的策略取决于班级之间存在的差异类型。DLDA似乎是对类不平衡最不敏感的分类器之一，它也被推荐用于多类问题。为了避免类不平衡问题，应该尽可能使用平衡的数据来计划实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Impact of class-imbalance on multi-class high-dimensional class prediction

The goal of multi-class supervised classification is to develop a rule that accurately predicts the class membership of new samples when the number of classes is larger than two. In this paper we consider high-dimensional class-imbalanced data: the number of variables greatly exceeds the number of samples and the number of samples in each class is not equal. We focus on Friedman's one-versus-one approach for three-class problems and show how its class probabilities depend on the class probabilities from the binary classification sub-problems. We further explore its performance using diagonal linear discriminant analysis (DLDA) as a base classifier and compare its performance with multi-class DLDA, using simulated and real data. Our results show that the class-imbalance has a significant effect on the classification results: the classification is biased towards the majority class as in the two-class problems and the problem is magnified when the number of variables is large. The amount of the bias depends also, jointly, on the magnitude of the differences between the classes and on the sample size: the bias diminishes when the difference between the classes is larger or the sample size is increased. Also variable selection plays an important role in the class-imbalance problem and the most effective strategy depends on the type of differences that exist between classes. DLDA seems to be among the least sensible classifiers to class-imbalance and its use is recommended also for multi-class problems. Whenever possible the experiments should be planned using balanced data in order to avoid the class-imbalance problem.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Advances in Methodology and Statistics

自引率

0.00%

发文量