Damayanti Elisabeth, I. Budi, Muhammad Okky Ibrohim
{"title":"Hate Code Detection in Indonesian Tweets using Machine Learning Approach: A Dataset and Preliminary Study","authors":"Damayanti Elisabeth, I. Budi, Muhammad Okky Ibrohim","doi":"10.1109/ICoICT49345.2020.9166251","DOIUrl":null,"url":null,"abstract":"The existence of social media causes side effects from freedom of speech to freedom to hate. People can spread hate speech with creative ways to avoid the hate speech detector. Implicit intends used using many codes. The purpose of using these codes is to disguise their hate speech targets. This paper presents an implementation of hate code detection for Indonesian tweets using machine learning and a classification explainer. First, we developed a dataset for hate codes ground truth. We generated hate codes from two scenarios i.e., hate code from hate speech classification and hate code from hate code classification. We used Logistic Regression (LR), Naive Bayes (NB), and Random Forest Decision Tree (RFDT) as our classifier. We also used TF-IDF and word bigrams as the features. The codes consist of word and phrase form. The best f-measure score is 94.90% from hate code classification using Logistic Regression with abusive codes elimination. This number means the model can detect all tweets that have no hate codes. For tweets that annotated have hate code, the f-measure is 28.23% for recognized all the hate codes, and the recall is 56.91%.","PeriodicalId":113108,"journal":{"name":"2020 8th International Conference on Information and Communication Technology (ICoICT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 8th International Conference on Information and Communication Technology (ICoICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICoICT49345.2020.9166251","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
The existence of social media causes side effects from freedom of speech to freedom to hate. People can spread hate speech with creative ways to avoid the hate speech detector. Implicit intends used using many codes. The purpose of using these codes is to disguise their hate speech targets. This paper presents an implementation of hate code detection for Indonesian tweets using machine learning and a classification explainer. First, we developed a dataset for hate codes ground truth. We generated hate codes from two scenarios i.e., hate code from hate speech classification and hate code from hate code classification. We used Logistic Regression (LR), Naive Bayes (NB), and Random Forest Decision Tree (RFDT) as our classifier. We also used TF-IDF and word bigrams as the features. The codes consist of word and phrase form. The best f-measure score is 94.90% from hate code classification using Logistic Regression with abusive codes elimination. This number means the model can detect all tweets that have no hate codes. For tweets that annotated have hate code, the f-measure is 28.23% for recognized all the hate codes, and the recall is 56.91%.