A dual algorithmic approach to deal with multiclass imbalanced classification problems

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research Pub Date : 2024-08-02 DOI:10.1016/j.bdr.2024.100484

S. Sridhar , S. Anusuya

{"title":"A dual algorithmic approach to deal with multiclass imbalanced classification problems","authors":"S. Sridhar , S. Anusuya","doi":"10.1016/j.bdr.2024.100484","DOIUrl":null,"url":null,"abstract":"<div><p>Many real-world applications involve multiclass classification problems, and often the data across classes is not evenly distributed. Due to this disproportion, supervised learning models tend to classify instances towards the class with the maximum number of instances, which is a severe issue that needs to be addressed. In multiclass imbalanced data classification, machine learning researchers try to reduce the learning model's bias towards the class with a high sample count. Researchers attempt to reduce this unfairness by either balancing the data before the classifier learns it, modifying the classifier's learning phase to pay more attention to the class with a minimum number of instances, or a combination of both. The existing algorithmic approaches find it difficult to understand the clear boundary between the samples of different classes due to unfair class distribution and overlapping issues. As a result, the minority class recognition rate is poor. A new algorithmic approach is proposed that uses dual decision trees. One is used to create an induced dataset using a PCA based grouping approach and by assigning weights to the data samples followed by another decision tree to learn and predict from the induced dataset. The distinct feature of this algorithmic approach is that it recognizes the data instances without altering their underlying data distribution and is applicable for all categories of multiclass imbalanced datasets. Five multiclass imbalanced datasets from UCI were used to classify the data using our proposed algorithm, and the results revealed that the duo-decision tree approach pays better attention to both the minor and major class samples.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"38 ","pages":"Article 100484"},"PeriodicalIF":4.2000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579624000595","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Many real-world applications involve multiclass classification problems, and often the data across classes is not evenly distributed. Due to this disproportion, supervised learning models tend to classify instances towards the class with the maximum number of instances, which is a severe issue that needs to be addressed. In multiclass imbalanced data classification, machine learning researchers try to reduce the learning model's bias towards the class with a high sample count. Researchers attempt to reduce this unfairness by either balancing the data before the classifier learns it, modifying the classifier's learning phase to pay more attention to the class with a minimum number of instances, or a combination of both. The existing algorithmic approaches find it difficult to understand the clear boundary between the samples of different classes due to unfair class distribution and overlapping issues. As a result, the minority class recognition rate is poor. A new algorithmic approach is proposed that uses dual decision trees. One is used to create an induced dataset using a PCA based grouping approach and by assigning weights to the data samples followed by another decision tree to learn and predict from the induced dataset. The distinct feature of this algorithmic approach is that it recognizes the data instances without altering their underlying data distribution and is applicable for all categories of multiclass imbalanced datasets. Five multiclass imbalanced datasets from UCI were used to classify the data using our proposed algorithm, and the results revealed that the duo-decision tree approach pays better attention to both the minor and major class samples.

查看原文本刊更多论文

处理多类不平衡分类问题的双重算法方法

现实世界中的许多应用都涉及多类分类问题，而不同类别的数据往往分布不均。由于这种比例失调，监督学习模型倾向于将实例分类到实例数量最多的类别，这是一个亟待解决的严重问题。在多类不平衡数据分类中，机器学习研究人员试图减少学习模型对样本数量多的类别的偏向。研究人员试图通过以下两种方法来减少这种不公平现象：在分类器学习之前平衡数据；修改分类器的学习阶段，使其更加关注实例数量最少的类别；或者两者相结合。由于类别分布不公平和重叠问题，现有的算法方法很难理解不同类别样本之间的明确界限。因此，少数类别的识别率很低。本文提出了一种使用双决策树的新算法方法。一棵决策树用于使用基于 PCA 的分组方法创建诱导数据集，并为数据样本分配权重，然后另一棵决策树从诱导数据集中学习和预测。这种算法方法的显著特点是，它能在不改变基础数据分布的情况下识别数据实例，适用于所有类别的多类不平衡数据集。使用我们提出的算法对来自 UCI 的五个多类不平衡数据集进行了分类，结果显示，双决策树方法能更好地关注小类和大类样本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data Research Computer Science-Computer Science Applications

CiteScore

8.40

自引率

3.00%

发文量

期刊介绍： The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.