A Malay Language Cyberbullying Detection Model on Twitter using Supervised Machine Learning

Nurina Farhanah Binti Johari, J. Jaafar
{"title":"A Malay Language Cyberbullying Detection Model on Twitter using Supervised Machine Learning","authors":"Nurina Farhanah Binti Johari, J. Jaafar","doi":"10.1109/IVIT55443.2022.10033395","DOIUrl":null,"url":null,"abstract":"This research detects cyberbullying for the Malay language using supervised machine learning (ML) and Natural Language Processing (NLP). Due to the high number of cyberbullying cases in Malaysia over the years and the belief that there is an increased number of unreported cyberbullying cases, there needs an intelligent way to detect cyberbullying on social media. Thus, this research explores how supervised ML and NLP can help detect cyberbullying incidents for the Malay language on social media. The dataset was collected from Twitter by scrapping tweets based on some common Malay words used in cyberbullying incidents before being labelled into six cyberbullying classes: appearance, intellectual, political, racial, sexual, and non-abusive. The resulting cyberbullying dataset is an imbalanced dataset with 45,580 tweets. The model is then built using Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF) algorithms combined with three different feature extraction techniques, that is Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec. The result indicates that the best model uses LR combined with the TF-IDF feature extraction technique. The model was improved further by using an oversampling technique (Synthetic Minority Oversampling Technique, SMOTE) to deal with the imbalanced dataset and tuning the model hyperparameters. The F-Score of the optimised TF-IDF – LR is 0.46.","PeriodicalId":325667,"journal":{"name":"2022 International Visualization, Informatics and Technology Conference (IVIT)","volume":"06 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Visualization, Informatics and Technology Conference (IVIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IVIT55443.2022.10033395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This research detects cyberbullying for the Malay language using supervised machine learning (ML) and Natural Language Processing (NLP). Due to the high number of cyberbullying cases in Malaysia over the years and the belief that there is an increased number of unreported cyberbullying cases, there needs an intelligent way to detect cyberbullying on social media. Thus, this research explores how supervised ML and NLP can help detect cyberbullying incidents for the Malay language on social media. The dataset was collected from Twitter by scrapping tweets based on some common Malay words used in cyberbullying incidents before being labelled into six cyberbullying classes: appearance, intellectual, political, racial, sexual, and non-abusive. The resulting cyberbullying dataset is an imbalanced dataset with 45,580 tweets. The model is then built using Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF) algorithms combined with three different feature extraction techniques, that is Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec. The result indicates that the best model uses LR combined with the TF-IDF feature extraction technique. The model was improved further by using an oversampling technique (Synthetic Minority Oversampling Technique, SMOTE) to deal with the imbalanced dataset and tuning the model hyperparameters. The F-Score of the optimised TF-IDF – LR is 0.46.
使用监督机器学习的Twitter马来语网络欺凌检测模型
这项研究使用监督机器学习(ML)和自然语言处理(NLP)检测马来语的网络欺凌。由于多年来马来西亚的网络欺凌案件数量众多,并且认为未报告的网络欺凌案件数量有所增加,因此需要一种智能的方法来检测社交媒体上的网络欺凌。因此,本研究探讨了监督机器学习和自然语言处理如何帮助检测社交媒体上的马来语网络欺凌事件。该数据集是根据网络欺凌事件中使用的一些常见马来语词汇从Twitter上收集的,然后将其分为六个网络欺凌类别:外表、智力、政治、种族、性和非虐待。由此产生的网络欺凌数据集是一个包含45,580条tweet的不平衡数据集。然后使用逻辑回归(LR)、Naïve贝叶斯(NB)、支持向量机(SVM)和随机森林(RF)算法,结合三种不同的特征提取技术,即词袋(BoW)、词频-逆文档频率(TF-IDF)和Word2Vec,构建模型。结果表明,LR结合TF-IDF特征提取技术是最佳模型。采用过采样技术(Synthetic Minority oversampling technique, SMOTE)处理不平衡数据集,并对模型超参数进行调优,进一步改进了模型。优化后的TF-IDF - LR F-Score为0.46。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信