A Malay Language Cyberbullying Detection Model on Twitter using Supervised Machine Learning

2022 International Visualization, Informatics and Technology Conference (IVIT) Pub Date : 2022-11-01 DOI:10.1109/IVIT55443.2022.10033395

Nurina Farhanah Binti Johari, J. Jaafar

{"title":"A Malay Language Cyberbullying Detection Model on Twitter using Supervised Machine Learning","authors":"Nurina Farhanah Binti Johari, J. Jaafar","doi":"10.1109/IVIT55443.2022.10033395","DOIUrl":null,"url":null,"abstract":"This research detects cyberbullying for the Malay language using supervised machine learning (ML) and Natural Language Processing (NLP). Due to the high number of cyberbullying cases in Malaysia over the years and the belief that there is an increased number of unreported cyberbullying cases, there needs an intelligent way to detect cyberbullying on social media. Thus, this research explores how supervised ML and NLP can help detect cyberbullying incidents for the Malay language on social media. The dataset was collected from Twitter by scrapping tweets based on some common Malay words used in cyberbullying incidents before being labelled into six cyberbullying classes: appearance, intellectual, political, racial, sexual, and non-abusive. The resulting cyberbullying dataset is an imbalanced dataset with 45,580 tweets. The model is then built using Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF) algorithms combined with three different feature extraction techniques, that is Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec. The result indicates that the best model uses LR combined with the TF-IDF feature extraction technique. The model was improved further by using an oversampling technique (Synthetic Minority Oversampling Technique, SMOTE) to deal with the imbalanced dataset and tuning the model hyperparameters. The F-Score of the optimised TF-IDF – LR is 0.46.","PeriodicalId":325667,"journal":{"name":"2022 International Visualization, Informatics and Technology Conference (IVIT)","volume":"06 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Visualization, Informatics and Technology Conference (IVIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IVIT55443.2022.10033395","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This research detects cyberbullying for the Malay language using supervised machine learning (ML) and Natural Language Processing (NLP). Due to the high number of cyberbullying cases in Malaysia over the years and the belief that there is an increased number of unreported cyberbullying cases, there needs an intelligent way to detect cyberbullying on social media. Thus, this research explores how supervised ML and NLP can help detect cyberbullying incidents for the Malay language on social media. The dataset was collected from Twitter by scrapping tweets based on some common Malay words used in cyberbullying incidents before being labelled into six cyberbullying classes: appearance, intellectual, political, racial, sexual, and non-abusive. The resulting cyberbullying dataset is an imbalanced dataset with 45,580 tweets. The model is then built using Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF) algorithms combined with three different feature extraction techniques, that is Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec. The result indicates that the best model uses LR combined with the TF-IDF feature extraction technique. The model was improved further by using an oversampling technique (Synthetic Minority Oversampling Technique, SMOTE) to deal with the imbalanced dataset and tuning the model hyperparameters. The F-Score of the optimised TF-IDF – LR is 0.46.

查看原文本刊更多论文

使用监督机器学习的Twitter马来语网络欺凌检测模型

这项研究使用监督机器学习(ML)和自然语言处理(NLP)检测马来语的网络欺凌。由于多年来马来西亚的网络欺凌案件数量众多，并且认为未报告的网络欺凌案件数量有所增加，因此需要一种智能的方法来检测社交媒体上的网络欺凌。因此，本研究探讨了监督机器学习和自然语言处理如何帮助检测社交媒体上的马来语网络欺凌事件。该数据集是根据网络欺凌事件中使用的一些常见马来语词汇从Twitter上收集的，然后将其分为六个网络欺凌类别:外表、智力、政治、种族、性和非虐待。由此产生的网络欺凌数据集是一个包含45,580条tweet的不平衡数据集。然后使用逻辑回归(LR)、Naïve贝叶斯(NB)、支持向量机(SVM)和随机森林(RF)算法，结合三种不同的特征提取技术，即词袋(BoW)、词频-逆文档频率(TF-IDF)和Word2Vec，构建模型。结果表明，LR结合TF-IDF特征提取技术是最佳模型。采用过采样技术(Synthetic Minority oversampling technique, SMOTE)处理不平衡数据集，并对模型超参数进行调优，进一步改进了模型。优化后的TF-IDF - LR F-Score为0.46。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Visualization, Informatics and Technology Conference (IVIT)

自引率

0.00%

发文量