Hate Code Detection in Indonesian Tweets using Machine Learning Approach: A Dataset and Preliminary Study

2020 8th International Conference on Information and Communication Technology (ICoICT) Pub Date : 2020-06-01 DOI:10.1109/ICoICT49345.2020.9166251

Damayanti Elisabeth, I. Budi, Muhammad Okky Ibrohim

{"title":"Hate Code Detection in Indonesian Tweets using Machine Learning Approach: A Dataset and Preliminary Study","authors":"Damayanti Elisabeth, I. Budi, Muhammad Okky Ibrohim","doi":"10.1109/ICoICT49345.2020.9166251","DOIUrl":null,"url":null,"abstract":"The existence of social media causes side effects from freedom of speech to freedom to hate. People can spread hate speech with creative ways to avoid the hate speech detector. Implicit intends used using many codes. The purpose of using these codes is to disguise their hate speech targets. This paper presents an implementation of hate code detection for Indonesian tweets using machine learning and a classification explainer. First, we developed a dataset for hate codes ground truth. We generated hate codes from two scenarios i.e., hate code from hate speech classification and hate code from hate code classification. We used Logistic Regression (LR), Naive Bayes (NB), and Random Forest Decision Tree (RFDT) as our classifier. We also used TF-IDF and word bigrams as the features. The codes consist of word and phrase form. The best f-measure score is 94.90% from hate code classification using Logistic Regression with abusive codes elimination. This number means the model can detect all tweets that have no hate codes. For tweets that annotated have hate code, the f-measure is 28.23% for recognized all the hate codes, and the recall is 56.91%.","PeriodicalId":113108,"journal":{"name":"2020 8th International Conference on Information and Communication Technology (ICoICT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 8th International Conference on Information and Communication Technology (ICoICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICoICT49345.2020.9166251","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

The existence of social media causes side effects from freedom of speech to freedom to hate. People can spread hate speech with creative ways to avoid the hate speech detector. Implicit intends used using many codes. The purpose of using these codes is to disguise their hate speech targets. This paper presents an implementation of hate code detection for Indonesian tweets using machine learning and a classification explainer. First, we developed a dataset for hate codes ground truth. We generated hate codes from two scenarios i.e., hate code from hate speech classification and hate code from hate code classification. We used Logistic Regression (LR), Naive Bayes (NB), and Random Forest Decision Tree (RFDT) as our classifier. We also used TF-IDF and word bigrams as the features. The codes consist of word and phrase form. The best f-measure score is 94.90% from hate code classification using Logistic Regression with abusive codes elimination. This number means the model can detect all tweets that have no hate codes. For tweets that annotated have hate code, the f-measure is 28.23% for recognized all the hate codes, and the recall is 56.91%.

查看原文本刊更多论文

使用机器学习方法检测印度尼西亚推文中的仇恨代码:一个数据集和初步研究

社交媒体的存在带来了从言论自由到仇恨自由的副作用。人们可以通过创造性的方式传播仇恨言论，以避开仇恨言论检测器。隐式意图在许多代码中使用。使用这些代码的目的是掩盖他们的仇恨言论目标。本文介绍了使用机器学习和分类解释器对印度尼西亚推文进行仇恨代码检测的实现。首先，我们开发了一个仇恨代码的数据集。我们从两种情况下生成仇恨代码，即来自仇恨言论分类的仇恨代码和来自仇恨代码分类的仇恨代码。我们使用逻辑回归(LR)，朴素贝叶斯(NB)和随机森林决策树(RFDT)作为我们的分类器。我们还使用TF-IDF和单词双元图作为特征。代码由单词和短语形式组成。使用逻辑回归剔除恶意代码分类的最佳f-measure得分为94.90%。这个数字意味着该模型可以检测到所有没有仇恨代码的推文。对于标注了仇恨代码的推文，识别所有仇恨代码的f度量为28.23%，召回率为56.91%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 8th International Conference on Information and Communication Technology (ICoICT)

自引率

0.00%

发文量