Umu Amanah Nur Rohmawati, S. W. Sihwi, D. E. Cahyani
{"title":"SEMAR: An Interface for Indonesian Hate Speech Detection Using Machine Learning","authors":"Umu Amanah Nur Rohmawati, S. W. Sihwi, D. E. Cahyani","doi":"10.1109/ISRITI.2018.8864484","DOIUrl":null,"url":null,"abstract":"Hate Speech has become government and public's concern because of the high number of hate speech cases on social media that occur in Indonesia, which are getting increased in recent years. Because of that, Indonesian hate speech detection becomes crucial. This research proposes SEMAR, an engine to detect Indonesian hate speech built using machine learning technique. This study tested and compared popular supervised algorithms including Naive Bayes Classifier (NBC), Decision Tree (DT), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Logistic Regression (LR) to determine which of the method is most suitable for solving Indonesian hate speech issue. It also compared two vectorizers, which are Hashing vectorizer and Term Frequency Inverse Document Frequency (TF-IDF). SEMAR interfaces were successfully developed, they are Application Programming Interface (API) and anti-hate comment WordPress plugin. SEMAR API was implemented using SVM with TF-IDF model, due to the highest accuracy with average score is (0.870726276). API allows web developer to use machine learning model by accessing endpoint URL from where the API is served and do not need training the model every time they use it, while WordPress is chosen because it is the most widely used Content Management System (CMS) for creating websites in the world (31,7%). Not only detecting hate comment automatically, but the system also designed to make training data continues to grow. It allows user to give feedback on prediction given by engine, feedback stored into database as new training data. The System will perform self-training daily using both old and new training data so the model's performance will improve time by time.","PeriodicalId":162781,"journal":{"name":"2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISRITI.2018.8864484","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Hate Speech has become government and public's concern because of the high number of hate speech cases on social media that occur in Indonesia, which are getting increased in recent years. Because of that, Indonesian hate speech detection becomes crucial. This research proposes SEMAR, an engine to detect Indonesian hate speech built using machine learning technique. This study tested and compared popular supervised algorithms including Naive Bayes Classifier (NBC), Decision Tree (DT), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Logistic Regression (LR) to determine which of the method is most suitable for solving Indonesian hate speech issue. It also compared two vectorizers, which are Hashing vectorizer and Term Frequency Inverse Document Frequency (TF-IDF). SEMAR interfaces were successfully developed, they are Application Programming Interface (API) and anti-hate comment WordPress plugin. SEMAR API was implemented using SVM with TF-IDF model, due to the highest accuracy with average score is (0.870726276). API allows web developer to use machine learning model by accessing endpoint URL from where the API is served and do not need training the model every time they use it, while WordPress is chosen because it is the most widely used Content Management System (CMS) for creating websites in the world (31,7%). Not only detecting hate comment automatically, but the system also designed to make training data continues to grow. It allows user to give feedback on prediction given by engine, feedback stored into database as new training data. The System will perform self-training daily using both old and new training data so the model's performance will improve time by time.