Muhammad Alkaff, Muhammad Afrizal Miqdad, Muhammad Fachrurrazi, Muhammad Nur Abdi, Ahmad Zainul Abidin, Raisa Amalia
{"title":"使用机器学习方法检测Instagram上的孟加拉语仇恨言论","authors":"Muhammad Alkaff, Muhammad Afrizal Miqdad, Muhammad Fachrurrazi, Muhammad Nur Abdi, Ahmad Zainul Abidin, Raisa Amalia","doi":"10.30812/matrik.v22i3.2939","DOIUrl":null,"url":null,"abstract":"Hate speech refers to verbal expression or communication that aims to provoke or discriminate against individuals. The Ministry of Communication and Information of Indonesia has encountered and dealt with 3,640 cases of hate speech transmitted through digital channels between 2018 and 2021. Particularly in South Kalimantan, hate speech in the local language, Banjarese has become increasingly prevalent in recent years. Surprisingly, there is a lack of research on using machine learning to detect hate speech in the Banjarese language, specifically on Instagram. Therefore, this study aimed to address this gap by constructing a dataset of Banjarese language hate speech and comparing various feature extraction and machine learning models to detect Banjarese language hate speech effectively. Thisresearch used several feature extraction techniques and machine learning methods to detect Banjareselanguage hate speech. The feature extraction methods used were Word N-Gram, Term Frequency- Inverse Document Frequency (TF-IDF), a combination of Word N-Gram and TF-IDF, Word2Vec, and Glove, while the machine learning methods used were Support Vector Machine (SVM), Na¨ıve Bayes, and Decision Tree. The results of this study revealed that the combination of TF-IDF for feature extraction and SVM as the model achieves exceptional performance. The average Recall, Precision, Accuracy, and F1-Score score exceeded 90%, demonstrating the model’s ability to identify Banjarese hate speech accurately.","PeriodicalId":364657,"journal":{"name":"MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hate Speech Detection for Banjarese Languages on Instagram Using Machine Learning Methods\",\"authors\":\"Muhammad Alkaff, Muhammad Afrizal Miqdad, Muhammad Fachrurrazi, Muhammad Nur Abdi, Ahmad Zainul Abidin, Raisa Amalia\",\"doi\":\"10.30812/matrik.v22i3.2939\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hate speech refers to verbal expression or communication that aims to provoke or discriminate against individuals. The Ministry of Communication and Information of Indonesia has encountered and dealt with 3,640 cases of hate speech transmitted through digital channels between 2018 and 2021. Particularly in South Kalimantan, hate speech in the local language, Banjarese has become increasingly prevalent in recent years. Surprisingly, there is a lack of research on using machine learning to detect hate speech in the Banjarese language, specifically on Instagram. Therefore, this study aimed to address this gap by constructing a dataset of Banjarese language hate speech and comparing various feature extraction and machine learning models to detect Banjarese language hate speech effectively. Thisresearch used several feature extraction techniques and machine learning methods to detect Banjareselanguage hate speech. The feature extraction methods used were Word N-Gram, Term Frequency- Inverse Document Frequency (TF-IDF), a combination of Word N-Gram and TF-IDF, Word2Vec, and Glove, while the machine learning methods used were Support Vector Machine (SVM), Na¨ıve Bayes, and Decision Tree. The results of this study revealed that the combination of TF-IDF for feature extraction and SVM as the model achieves exceptional performance. The average Recall, Precision, Accuracy, and F1-Score score exceeded 90%, demonstrating the model’s ability to identify Banjarese hate speech accurately.\",\"PeriodicalId\":364657,\"journal\":{\"name\":\"MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30812/matrik.v22i3.2939\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30812/matrik.v22i3.2939","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
仇恨言论是指以挑衅或歧视个人为目的的言语表达或交流。印度尼西亚通信和信息部在2018年至2021年期间遭遇并处理了3640起通过数字渠道传播的仇恨言论案件。特别是在南加里曼丹,用当地语言班加雷斯语的仇恨言论近年来变得越来越普遍。令人惊讶的是,目前还缺乏使用机器学习来检测孟加拉语仇恨言论的研究,特别是在Instagram上。因此,本研究旨在通过构建巴尼亚语仇恨言论数据集,并比较各种特征提取和机器学习模型来有效检测巴尼亚语仇恨言论,从而解决这一空白。本研究使用了几种特征提取技术和机器学习方法来检测巴贾雷语的仇恨言论。使用的特征提取方法是Word N-Gram、Term Frequency- Inverse Document Frequency (TF-IDF)、Word N-Gram和TF-IDF的组合、Word2Vec和Glove,而使用的机器学习方法是支持向量机(SVM)、纳伊ıve贝叶斯和决策树。本研究结果表明,结合TF-IDF进行特征提取和SVM作为模型取得了优异的性能。Recall, Precision, Accuracy和F1-Score的平均得分超过90%,表明该模型能够准确识别Banjarese仇恨言论。
Hate Speech Detection for Banjarese Languages on Instagram Using Machine Learning Methods
Hate speech refers to verbal expression or communication that aims to provoke or discriminate against individuals. The Ministry of Communication and Information of Indonesia has encountered and dealt with 3,640 cases of hate speech transmitted through digital channels between 2018 and 2021. Particularly in South Kalimantan, hate speech in the local language, Banjarese has become increasingly prevalent in recent years. Surprisingly, there is a lack of research on using machine learning to detect hate speech in the Banjarese language, specifically on Instagram. Therefore, this study aimed to address this gap by constructing a dataset of Banjarese language hate speech and comparing various feature extraction and machine learning models to detect Banjarese language hate speech effectively. Thisresearch used several feature extraction techniques and machine learning methods to detect Banjareselanguage hate speech. The feature extraction methods used were Word N-Gram, Term Frequency- Inverse Document Frequency (TF-IDF), a combination of Word N-Gram and TF-IDF, Word2Vec, and Glove, while the machine learning methods used were Support Vector Machine (SVM), Na¨ıve Bayes, and Decision Tree. The results of this study revealed that the combination of TF-IDF for feature extraction and SVM as the model achieves exceptional performance. The average Recall, Precision, Accuracy, and F1-Score score exceeded 90%, demonstrating the model’s ability to identify Banjarese hate speech accurately.