SEMAR: An Interface for Indonesian Hate Speech Detection Using Machine Learning

2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI) Pub Date : 2018-11-01 DOI:10.1109/ISRITI.2018.8864484

Umu Amanah Nur Rohmawati, S. W. Sihwi, D. E. Cahyani

{"title":"SEMAR: An Interface for Indonesian Hate Speech Detection Using Machine Learning","authors":"Umu Amanah Nur Rohmawati, S. W. Sihwi, D. E. Cahyani","doi":"10.1109/ISRITI.2018.8864484","DOIUrl":null,"url":null,"abstract":"Hate Speech has become government and public's concern because of the high number of hate speech cases on social media that occur in Indonesia, which are getting increased in recent years. Because of that, Indonesian hate speech detection becomes crucial. This research proposes SEMAR, an engine to detect Indonesian hate speech built using machine learning technique. This study tested and compared popular supervised algorithms including Naive Bayes Classifier (NBC), Decision Tree (DT), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Logistic Regression (LR) to determine which of the method is most suitable for solving Indonesian hate speech issue. It also compared two vectorizers, which are Hashing vectorizer and Term Frequency Inverse Document Frequency (TF-IDF). SEMAR interfaces were successfully developed, they are Application Programming Interface (API) and anti-hate comment WordPress plugin. SEMAR API was implemented using SVM with TF-IDF model, due to the highest accuracy with average score is (0.870726276). API allows web developer to use machine learning model by accessing endpoint URL from where the API is served and do not need training the model every time they use it, while WordPress is chosen because it is the most widely used Content Management System (CMS) for creating websites in the world (31,7%). Not only detecting hate comment automatically, but the system also designed to make training data continues to grow. It allows user to give feedback on prediction given by engine, feedback stored into database as new training data. The System will perform self-training daily using both old and new training data so the model's performance will improve time by time.","PeriodicalId":162781,"journal":{"name":"2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISRITI.2018.8864484","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Hate Speech has become government and public's concern because of the high number of hate speech cases on social media that occur in Indonesia, which are getting increased in recent years. Because of that, Indonesian hate speech detection becomes crucial. This research proposes SEMAR, an engine to detect Indonesian hate speech built using machine learning technique. This study tested and compared popular supervised algorithms including Naive Bayes Classifier (NBC), Decision Tree (DT), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Logistic Regression (LR) to determine which of the method is most suitable for solving Indonesian hate speech issue. It also compared two vectorizers, which are Hashing vectorizer and Term Frequency Inverse Document Frequency (TF-IDF). SEMAR interfaces were successfully developed, they are Application Programming Interface (API) and anti-hate comment WordPress plugin. SEMAR API was implemented using SVM with TF-IDF model, due to the highest accuracy with average score is (0.870726276). API allows web developer to use machine learning model by accessing endpoint URL from where the API is served and do not need training the model every time they use it, while WordPress is chosen because it is the most widely used Content Management System (CMS) for creating websites in the world (31,7%). Not only detecting hate comment automatically, but the system also designed to make training data continues to grow. It allows user to give feedback on prediction given by engine, feedback stored into database as new training data. The System will perform self-training daily using both old and new training data so the model's performance will improve time by time.

查看原文本刊更多论文

基于机器学习的印尼仇恨言论检测界面

仇恨言论已经成为政府和公众关注的问题，因为印度尼西亚的社交媒体上出现了大量的仇恨言论案件，近年来这种情况有所增加。正因为如此，印尼的仇恨言论检测变得至关重要。本研究提出了SEMAR，一个使用机器学习技术检测印度尼西亚仇恨言论的引擎。本研究测试并比较了朴素贝叶斯分类器(NBC)、决策树(DT)、k近邻(KNN)、支持向量机(SVM)和逻辑回归(LR)等流行的监督算法，以确定哪种方法最适合解决印度尼西亚的仇恨言论问题。本文还比较了哈希矢量器和术语频率逆文档频率(TF-IDF)两种矢量器。SEMAR接口开发成功，他们是应用程序编程接口(API)和反仇恨评论WordPress插件。SEMAR API使用TF-IDF模型的SVM实现，因为准确率最高，平均得分为(0.870726276)。API允许web开发人员通过访问API服务的端点URL来使用机器学习模型，并且不需要每次使用它都训练模型，而选择WordPress是因为它是世界上最广泛使用的内容管理系统(CMS)创建网站(31.7%)。该系统不仅能自动检测仇恨评论，还能使训练数据持续增长。它允许用户对引擎给出的预测给出反馈，反馈作为新的训练数据存储到数据库中。系统将每天使用新旧训练数据进行自我训练，因此模型的性能将随着时间的推移而提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)

自引率

0.00%

发文量