Enhancing Imbalanced Dataset by Utilizing (K-NN Based SMOTE_3D Algorithm)

IEEE International Conference on Robotics and Automation : ICRA : [proceedings]. IEEE International Conference on Robotics and Automation Pub Date : 2020-04-25 DOI:10.17352/ARA.000002

Khaldoon Alshouiliy, Sujan Ray, A. Al-Ghamdi, D. Agrawal

{"title":"Enhancing Imbalanced Dataset by Utilizing (K-NN Based SMOTE_3D Algorithm)","authors":"Khaldoon Alshouiliy, Sujan Ray, A. Al-Ghamdi, D. Agrawal","doi":"10.17352/ARA.000002","DOIUrl":null,"url":null,"abstract":"Big data is currently a huge industry that has grown significantly every year. Big data is being used by machine learning and deep learning algorithm to study, analyze and parse big data and then drive useful and beneficial results. However, most of the real datasets are collected through different organizations and social media and mainly fall under the category of Big Data applications. One of the biggest and most drawbacks of such datasets is an imbalance representation of samples from different categories. In such case, the classifiers and deep learning techniques are not capable of handling issues like these. A majority of existing works tend to overlook these issues. Typical data balancing methods in the literature resort to data resampling whether it is under sampling a majority class samples or oversampling the minority class of samples. In this work, we focus on the minority sample and ignore the majority ones. Many researchers have done many works as most of the work suffers from over sampling or form the generated noise in the dataset. Additionally, works are either suitable for either big data or small data. Moreover, some other work suffers from a long processing time as complicated algorithms are used with many steps to fix the imbalance problem. Therefore, we introduce a new algorithm that deals with all these issues. We have created a short example to explain briefly how the SMOTE works and why we need to enhance the SMOTE and we have done this by using a very well-known imbalance dataset that we downloaded from the Kaggle website. We collect the results by using Azure machine learning platform. Then, we compare the results to see that the model is functional just good with SMOTE and way better than without it.","PeriodicalId":73286,"journal":{"name":"IEEE International Conference on Robotics and Automation : ICRA : [proceedings]. IEEE International Conference on Robotics and Automation","volume":"4 1","pages":"001-006"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Robotics and Automation : ICRA : [proceedings]. IEEE International Conference on Robotics and Automation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17352/ARA.000002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Big data is currently a huge industry that has grown significantly every year. Big data is being used by machine learning and deep learning algorithm to study, analyze and parse big data and then drive useful and beneficial results. However, most of the real datasets are collected through different organizations and social media and mainly fall under the category of Big Data applications. One of the biggest and most drawbacks of such datasets is an imbalance representation of samples from different categories. In such case, the classifiers and deep learning techniques are not capable of handling issues like these. A majority of existing works tend to overlook these issues. Typical data balancing methods in the literature resort to data resampling whether it is under sampling a majority class samples or oversampling the minority class of samples. In this work, we focus on the minority sample and ignore the majority ones. Many researchers have done many works as most of the work suffers from over sampling or form the generated noise in the dataset. Additionally, works are either suitable for either big data or small data. Moreover, some other work suffers from a long processing time as complicated algorithms are used with many steps to fix the imbalance problem. Therefore, we introduce a new algorithm that deals with all these issues. We have created a short example to explain briefly how the SMOTE works and why we need to enhance the SMOTE and we have done this by using a very well-known imbalance dataset that we downloaded from the Kaggle website. We collect the results by using Azure machine learning platform. Then, we compare the results to see that the model is functional just good with SMOTE and way better than without it.

查看原文本刊更多论文

利用（基于K-NN的SMOTE_3D算法）增强不平衡数据集

大数据目前是一个巨大的行业，每年都在显著增长。机器学习和深度学习算法正在使用大数据来研究、分析和解析大数据，然后得出有用和有益的结果。然而，大多数真实数据集是通过不同的组织和社交媒体收集的，主要属于大数据应用程序的范畴。这种数据集最大也是最缺点之一是来自不同类别的样本的不平衡表示。在这种情况下，分类器和深度学习技术无法处理此类问题。现有的大多数作品往往忽略了这些问题。文献中的典型数据平衡方法采用数据重采样，无论是对多数类样本的欠采样还是对少数类样本的过采样。在这项工作中，我们专注于少数样本，而忽略了多数样本。许多研究人员已经做了许多工作，因为大多数工作都存在过度采样或在数据集中形成生成的噪声的问题。此外，作品要么适合大数据，要么适合小数据。此外，由于使用复杂的算法和许多步骤来解决不平衡问题，一些其他工作的处理时间很长。因此，我们引入了一种新的算法来处理所有这些问题。我们创建了一个简短的例子来简要解释SMOTE是如何工作的，以及为什么我们需要增强SMOTE，我们通过使用从Kaggle网站下载的一个非常著名的不平衡数据集来做到这一点。我们使用Azure机器学习平台收集结果。然后，我们比较了结果，发现有了SMOTE，模型的功能就很好，而且比没有SMOTE要好得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Robotics and Automation : ICRA : [proceedings]. IEEE International Conference on Robotics and Automation

CiteScore

6.80

自引率

0.00%

发文量