非平衡数据集中的集群欠采样作为改进软件缺陷预测方法的综述

Journal of Applied Science, Information and Computing Pub Date : 2024-01-13 DOI:10.59568/jasic-2023-4-2-03

Sani Abdulhamid, Manjula V. S, Zayyad Musa Ahmed

{"title":"非平衡数据集中的集群欠采样作为改进软件缺陷预测方法的综述","authors":"Sani Abdulhamid, Manjula V. S, Zayyad Musa Ahmed","doi":"10.59568/jasic-2023-4-2-03","DOIUrl":null,"url":null,"abstract":"In many real-world machine learning applications, including software defect prediction, detecting fraud, detection of network intrusion and penetration, managing risk, and medical dataset, class imbalance is an inherent issue. It happens when there aren't many instances of a certain class mostly the class the procedure is meant to identify because the occurrence the class reflects is rare. The considerable priority placed on correctly classifying the relatively minority instances—which incur a higher cost if incorrectly categorized than the majority instances—is a major driving force for class imbalance learning. Supervised models are often designed to maximize the overall classification accuracy; however, because minority examples are rare in the training data, they typically misclassify minority instances. Training a model is facilitated by balancing the dataset since it keeps the model from becoming biased in favor of one class. Put another way, just because the model has more data, it won't automatically favor the majority class. One method of reducing the issue of class imbalance before training classification models is data sampling; however, the majority of the methods now in use introduce additional issues during the sampling process and frequently overlook other concerns related to the quality of the data. Therefore, the goal of this work is to create an effective sampling algorithm that, by employing a straightforward logical framework, enhances the performance of classification algorithms. By providing a thorough literature on class imbalance while developing and putting into practice a novel Cluster Under Sampling Technique (CUST), this research advances both academia and industry. It has been demonstrated that CUST greatly enhances the performance of popular classification techniques like C 4.5 decision tree and One Rule when learning from imbalance datasets.","PeriodicalId":167914,"journal":{"name":"Journal of Applied Science, Information and Computing","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Review Of Cluster Under-Sampling In Unbalanced Dataset As A Methods For Improving Software Defect Prediction\",\"authors\":\"Sani Abdulhamid, Manjula V. S, Zayyad Musa Ahmed\",\"doi\":\"10.59568/jasic-2023-4-2-03\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In many real-world machine learning applications, including software defect prediction, detecting fraud, detection of network intrusion and penetration, managing risk, and medical dataset, class imbalance is an inherent issue. It happens when there aren't many instances of a certain class mostly the class the procedure is meant to identify because the occurrence the class reflects is rare. The considerable priority placed on correctly classifying the relatively minority instances—which incur a higher cost if incorrectly categorized than the majority instances—is a major driving force for class imbalance learning. Supervised models are often designed to maximize the overall classification accuracy; however, because minority examples are rare in the training data, they typically misclassify minority instances. Training a model is facilitated by balancing the dataset since it keeps the model from becoming biased in favor of one class. Put another way, just because the model has more data, it won't automatically favor the majority class. One method of reducing the issue of class imbalance before training classification models is data sampling; however, the majority of the methods now in use introduce additional issues during the sampling process and frequently overlook other concerns related to the quality of the data. Therefore, the goal of this work is to create an effective sampling algorithm that, by employing a straightforward logical framework, enhances the performance of classification algorithms. By providing a thorough literature on class imbalance while developing and putting into practice a novel Cluster Under Sampling Technique (CUST), this research advances both academia and industry. It has been demonstrated that CUST greatly enhances the performance of popular classification techniques like C 4.5 decision tree and One Rule when learning from imbalance datasets.\",\"PeriodicalId\":167914,\"journal\":{\"name\":\"Journal of Applied Science, Information and Computing\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Applied Science, Information and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.59568/jasic-2023-4-2-03\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Science, Information and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59568/jasic-2023-4-2-03","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在许多现实世界的机器学习应用中，包括软件缺陷预测、欺诈检测、网络入侵和渗透检测、风险管理和医疗数据集，类不平衡是一个固有的问题。当某个类别的实例不多时，就会出现这种情况，主要是程序要识别的类别，因为该类别所反映的情况非常罕见。正确分类相对较少的实例是一个相当重要的优先事项--如果分类错误，这些实例将比大多数实例付出更高的代价--这也是类不平衡学习的主要驱动力。监督模型的设计通常是为了最大限度地提高整体分类准确率；然而，由于少数实例在训练数据中很少见，它们通常会误分少数实例。平衡数据集有助于模型的训练，因为这样可以防止模型偏向于某一类。换句话说，不会因为模型有更多的数据，就自动偏向于大多数类别。在训练分类模型之前，减少类不平衡问题的一种方法是数据采样；然而，目前使用的大多数方法都会在采样过程中引入额外的问题，并且经常忽略与数据质量相关的其他问题。因此，这项工作的目标是创建一种有效的抽样算法，通过采用简单明了的逻辑框架，提高分类算法的性能。通过提供有关类不平衡的详尽文献，同时开发并实践一种新颖的聚类下抽样技术（CUST），这项研究推动了学术界和工业界的发展。研究表明，当从不平衡数据集学习时，CUST 能大大提高 C 4.5 决策树和单规则等流行分类技术的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Review Of Cluster Under-Sampling In Unbalanced Dataset As A Methods For Improving Software Defect Prediction

In many real-world machine learning applications, including software defect prediction, detecting fraud, detection of network intrusion and penetration, managing risk, and medical dataset, class imbalance is an inherent issue. It happens when there aren't many instances of a certain class mostly the class the procedure is meant to identify because the occurrence the class reflects is rare. The considerable priority placed on correctly classifying the relatively minority instances—which incur a higher cost if incorrectly categorized than the majority instances—is a major driving force for class imbalance learning. Supervised models are often designed to maximize the overall classification accuracy; however, because minority examples are rare in the training data, they typically misclassify minority instances. Training a model is facilitated by balancing the dataset since it keeps the model from becoming biased in favor of one class. Put another way, just because the model has more data, it won't automatically favor the majority class. One method of reducing the issue of class imbalance before training classification models is data sampling; however, the majority of the methods now in use introduce additional issues during the sampling process and frequently overlook other concerns related to the quality of the data. Therefore, the goal of this work is to create an effective sampling algorithm that, by employing a straightforward logical framework, enhances the performance of classification algorithms. By providing a thorough literature on class imbalance while developing and putting into practice a novel Cluster Under Sampling Technique (CUST), this research advances both academia and industry. It has been demonstrated that CUST greatly enhances the performance of popular classification techniques like C 4.5 decision tree and One Rule when learning from imbalance datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Applied Science, Information and Computing

自引率

0.00%

发文量