Enhancing Imbalanced Dataset by Utilizing (K-NN Based SMOTE_3D Algorithm)

Khaldoon Alshouiliy, Sujan Ray, A. Al-Ghamdi, D. Agrawal
{"title":"Enhancing Imbalanced Dataset by Utilizing (K-NN Based SMOTE_3D Algorithm)","authors":"Khaldoon Alshouiliy, Sujan Ray, A. Al-Ghamdi, D. Agrawal","doi":"10.17352/ARA.000002","DOIUrl":null,"url":null,"abstract":"Big data is currently a huge industry that has grown significantly every year. Big data is being used by machine learning and deep learning algorithm to study, analyze and parse big data and then drive useful and beneficial results. However, most of the real datasets are collected through different organizations and social media and mainly fall under the category of Big Data applications. One of the biggest and most drawbacks of such datasets is an imbalance representation of samples from different categories. In such case, the classifiers and deep learning techniques are not capable of handling issues like these. A majority of existing works tend to overlook these issues. Typical data balancing methods in the literature resort to data resampling whether it is under sampling a majority class samples or oversampling the minority class of samples. In this work, we focus on the minority sample and ignore the majority ones. Many researchers have done many works as most of the work suffers from over sampling or form the generated noise in the dataset. Additionally, works are either suitable for either big data or small data. Moreover, some other work suffers from a long processing time as complicated algorithms are used with many steps to fix the imbalance problem. Therefore, we introduce a new algorithm that deals with all these issues. We have created a short example to explain briefly how the SMOTE works and why we need to enhance the SMOTE and we have done this by using a very well-known imbalance dataset that we downloaded from the Kaggle website. We collect the results by using Azure machine learning platform. Then, we compare the results to see that the model is functional just good with SMOTE and way better than without it.","PeriodicalId":73286,"journal":{"name":"IEEE International Conference on Robotics and Automation : ICRA : [proceedings]. IEEE International Conference on Robotics and Automation","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Robotics and Automation : ICRA : [proceedings]. IEEE International Conference on Robotics and Automation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17352/ARA.000002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Big data is currently a huge industry that has grown significantly every year. Big data is being used by machine learning and deep learning algorithm to study, analyze and parse big data and then drive useful and beneficial results. However, most of the real datasets are collected through different organizations and social media and mainly fall under the category of Big Data applications. One of the biggest and most drawbacks of such datasets is an imbalance representation of samples from different categories. In such case, the classifiers and deep learning techniques are not capable of handling issues like these. A majority of existing works tend to overlook these issues. Typical data balancing methods in the literature resort to data resampling whether it is under sampling a majority class samples or oversampling the minority class of samples. In this work, we focus on the minority sample and ignore the majority ones. Many researchers have done many works as most of the work suffers from over sampling or form the generated noise in the dataset. Additionally, works are either suitable for either big data or small data. Moreover, some other work suffers from a long processing time as complicated algorithms are used with many steps to fix the imbalance problem. Therefore, we introduce a new algorithm that deals with all these issues. We have created a short example to explain briefly how the SMOTE works and why we need to enhance the SMOTE and we have done this by using a very well-known imbalance dataset that we downloaded from the Kaggle website. We collect the results by using Azure machine learning platform. Then, we compare the results to see that the model is functional just good with SMOTE and way better than without it.
利用(基于K-NN的SMOTE_3D算法)增强不平衡数据集
大数据目前是一个巨大的行业,每年都在显著增长。机器学习和深度学习算法正在使用大数据来研究、分析和解析大数据,然后得出有用和有益的结果。然而,大多数真实数据集是通过不同的组织和社交媒体收集的,主要属于大数据应用程序的范畴。这种数据集最大也是最缺点之一是来自不同类别的样本的不平衡表示。在这种情况下,分类器和深度学习技术无法处理此类问题。现有的大多数作品往往忽略了这些问题。文献中的典型数据平衡方法采用数据重采样,无论是对多数类样本的欠采样还是对少数类样本的过采样。在这项工作中,我们专注于少数样本,而忽略了多数样本。许多研究人员已经做了许多工作,因为大多数工作都存在过度采样或在数据集中形成生成的噪声的问题。此外,作品要么适合大数据,要么适合小数据。此外,由于使用复杂的算法和许多步骤来解决不平衡问题,一些其他工作的处理时间很长。因此,我们引入了一种新的算法来处理所有这些问题。我们创建了一个简短的例子来简要解释SMOTE是如何工作的,以及为什么我们需要增强SMOTE,我们通过使用从Kaggle网站下载的一个非常著名的不平衡数据集来做到这一点。我们使用Azure机器学习平台收集结果。然后,我们比较了结果,发现有了SMOTE,模型的功能就很好,而且比没有SMOTE要好得多。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.80
自引率
0.00%
发文量
0
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信