Overcoming Data Imbalance Problems in Sexual Harassment Classification with SMOTE

Aji Gautama Putrada, Irfan Dwi Wijaya, Dita Oktaria
{"title":"Overcoming Data Imbalance Problems in Sexual Harassment Classification with SMOTE","authors":"Aji Gautama Putrada, Irfan Dwi Wijaya, Dita Oktaria","doi":"10.21108/ijoict.v8i1.622","DOIUrl":null,"url":null,"abstract":"Delivery of justice with the help of artificial intelligence is a current research interest. Machine learning with natural language processing (NLP) can classify the types of sexual harassment experiences into quid pro quo (QPQ) and hostile work environments (HWE). However, imbalanced data are often present in classes of sexual harassment classification on specific datasets. Data imbalance can cause a decrease in the classifier's performance because it usually tends to choose the majority class. This study proposes the implementation and performance evaluation of the synthetic minority over-sampling technique (SMOTE) to improve the QPQ and HWE harassment classifications in the sexual harassment experience dataset. The term frequency-inverse document frequency (TF-IDF) method applies document weighting in the classification process. Then, we compare naïve Bayes with K-Nearest Neighbor (KNN) in classifying sexual harassment experiences. The comparison shows that the performance of the naïve Bayes classifier is superior to the KNN classifier in classifying QPQ and HWE, with AUC values of 0.95 versus 0.92, respectively. The evaluation results show that by applying the SMOTE method to the naïve Bayes classifier, the precision of the minority class can increase from 74% to 90%.","PeriodicalId":137090,"journal":{"name":"International Journal on Information and Communication Technology (IJoICT)","volume":"64 20","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal on Information and Communication Technology (IJoICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21108/ijoict.v8i1.622","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Delivery of justice with the help of artificial intelligence is a current research interest. Machine learning with natural language processing (NLP) can classify the types of sexual harassment experiences into quid pro quo (QPQ) and hostile work environments (HWE). However, imbalanced data are often present in classes of sexual harassment classification on specific datasets. Data imbalance can cause a decrease in the classifier's performance because it usually tends to choose the majority class. This study proposes the implementation and performance evaluation of the synthetic minority over-sampling technique (SMOTE) to improve the QPQ and HWE harassment classifications in the sexual harassment experience dataset. The term frequency-inverse document frequency (TF-IDF) method applies document weighting in the classification process. Then, we compare naïve Bayes with K-Nearest Neighbor (KNN) in classifying sexual harassment experiences. The comparison shows that the performance of the naïve Bayes classifier is superior to the KNN classifier in classifying QPQ and HWE, with AUC values of 0.95 versus 0.92, respectively. The evaluation results show that by applying the SMOTE method to the naïve Bayes classifier, the precision of the minority class can increase from 74% to 90%.
用SMOTE克服性骚扰分类中的数据不平衡问题
在人工智能的帮助下实现司法公正是当前的研究热点。使用自然语言处理(NLP)的机器学习可以将性骚扰经历的类型分为交换条件(QPQ)和敌对工作环境(HWE)。然而,在特定数据集的性骚扰分类类别中,往往存在数据不平衡的问题。数据不平衡会导致分类器性能下降,因为它通常倾向于选择多数类。为了改进性骚扰经验数据集中的QPQ和HWE骚扰分类,本研究提出了综合少数派过采样技术(SMOTE)的实施和性能评估。术语频率逆文档频率(TF-IDF)方法在分类过程中应用了文档加权。然后,我们比较了naïve贝叶斯和k -最近邻(KNN)对性骚扰经历的分类。对比发现naïve贝叶斯分类器对QPQ和HWE的分类性能优于KNN分类器,AUC值分别为0.95和0.92。评价结果表明,将SMOTE方法应用于naïve贝叶斯分类器,可以将少数类的准确率从74%提高到90%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信