A novel data balancing technique via resampling majority and minority classes toward effective classification

Q2 Engineering

Telkomnika (Telecommunication Computing Electronics and Control) Pub Date : 2023-12-01 DOI:10.12928/telkomnika.v21i6.25211

Mahmudul Hasan, Md. Fazle Rabbi, Md. Nahid Sultan, A. M. Nitu, Md. Palash Uddin

{"title":"A novel data balancing technique via resampling majority and minority classes toward effective classification","authors":"Mahmudul Hasan, Md. Fazle Rabbi, Md. Nahid Sultan, A. M. Nitu, Md. Palash Uddin","doi":"10.12928/telkomnika.v21i6.25211","DOIUrl":null,"url":null,"abstract":"Classification is a predictive modelling task in machine learning (ML), where the class label is determined for a specific example of predefined features. In determining handwriting characters, identifying spam, detecting disease, identifying signals, and so on, classification requires training data with many features and label instances. In medical informatics, high precision and recall are mandatory issues besides the high accuracy of the ML classifiers. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques perform the whole dataset at a time that sometimes causes overfitting and underfitting. We propose a data balancing technique that follows the divide and conquer procedure to cluster the dataset into several segments, and both oversampling and undersam-pling operation is performed on each cluster. Finally, the cluster joined together and built a balanced dataset. We chose the sample data of two heart disease datasets: Hungarian and Long Beach. Logistic regression and random forest classifier are the representatives of ML algorithms. We compare our proposed techniques with existing SMOTE, NearMiss, and SMOTETomek data balancing techniques. Both algorithms perform better on the proposed technique-balanced dataset. This technique can be the optimal solution for the imbalanced data handling strategy.","PeriodicalId":38281,"journal":{"name":"Telkomnika (Telecommunication Computing Electronics and Control)","volume":" 919","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Telkomnika (Telecommunication Computing Electronics and Control)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12928/telkomnika.v21i6.25211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Engineering","Score":null,"Total":0}

引用次数: 0

Abstract

Classification is a predictive modelling task in machine learning (ML), where the class label is determined for a specific example of predefined features. In determining handwriting characters, identifying spam, detecting disease, identifying signals, and so on, classification requires training data with many features and label instances. In medical informatics, high precision and recall are mandatory issues besides the high accuracy of the ML classifiers. Most of the real-life datasets have imbalanced characteristics that hamper the overall performance of the classifiers. Existing data balancing techniques perform the whole dataset at a time that sometimes causes overfitting and underfitting. We propose a data balancing technique that follows the divide and conquer procedure to cluster the dataset into several segments, and both oversampling and undersam-pling operation is performed on each cluster. Finally, the cluster joined together and built a balanced dataset. We chose the sample data of two heart disease datasets: Hungarian and Long Beach. Logistic regression and random forest classifier are the representatives of ML algorithms. We compare our proposed techniques with existing SMOTE, NearMiss, and SMOTETomek data balancing techniques. Both algorithms perform better on the proposed technique-balanced dataset. This technique can be the optimal solution for the imbalanced data handling strategy.

查看原文本刊更多论文

通过重采样多数类和少数类实现有效分类的新型数据平衡技术

分类是机器学习(ML)中的预测建模任务，其中类标签是为预定义特征的特定示例确定的。在确定手写字符、识别垃圾邮件、检测疾病、识别信号等方面，分类需要具有许多特征和标签实例的训练数据。在医学信息学中，除了机器学习分类器的高准确率外，高精度和召回率也是必须考虑的问题。大多数现实生活中的数据集具有不平衡的特征，这阻碍了分类器的整体性能。现有的数据平衡技术一次执行整个数据集，有时会导致过拟合和欠拟合。我们提出了一种数据平衡技术，该技术遵循分而治之的过程将数据集聚成几个部分，并在每个聚类上进行过采样和欠采样操作。最后，集群连接在一起，构建一个平衡的数据集。我们选择了匈牙利和长滩两个心脏病数据集的样本数据。逻辑回归和随机森林分类器是机器学习算法的代表。我们将我们提出的技术与现有的SMOTE、NearMiss和SMOTETomek数据平衡技术进行了比较。两种算法在所提出的技术平衡数据集上表现更好。这种技术可以成为不平衡数据处理策略的最佳解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Telkomnika (Telecommunication Computing Electronics and Control) Engineering-Electrical and Electronic Engineering

CiteScore

4.00

自引率

0.00%

发文量

158

期刊介绍： TELKOMNIKA (Telecommunication Computing Electronics and Control) is a peer reviewed International Journal in English published four issues per year (March, June, September and December). The aim of TELKOMNIKA is to publish high-quality articles dedicated to all aspects of the latest outstanding developments in the field of electrical engineering. Its scope encompasses the engineering of signal processing, electrical (power), electronics, instrumentation & control, telecommunication, computing and informatics which covers, but not limited to, the following scope: Signal Processing[...] Electronics[...] Electrical[...] Telecommunication[...] Instrumentation & Control[...] Computing and Informatics[...]