Endang Wahyu Pamungkas, A. Fatmawati, Farah Danisha Salam
{"title":"Hate Speech Detection on Indonesian Social Media: A Preliminary Study on Code-Mixed Language Issue","authors":"Endang Wahyu Pamungkas, A. Fatmawati, Farah Danisha Salam","doi":"10.1145/3582768.3582771","DOIUrl":null,"url":null,"abstract":"Nowadays, social media becomes an important media for online communication, facilitating its users to publish content and providing a medium to express their opinions and feelings about anything. At the same time, abusive language is becoming a relevant problem on social media platforms such as Facebook and Twitter. Geographically, Indonesia consists of several regions with their own local languages. A recent report shows 718 local languages used by different regions and tribes in Indonesia. Indonesian tend to use a mix of their own local language and Bahasa to communicate on social media platforms, such as Twitter. Similar to other languages, code-mixed is also becoming the main issue and challenge of detecting hate speech in Indonesian social media. In this study, we conduct a preliminary experiment to study the detection of hate speech in Indonesian social media, specifically Twitter. Our experiment used 6,115 tweets in Indonesian-Javanese code-mixed and 2,945 tweets in Indonesian-Sundanese code-mixed. The overall results show that the traditional machine learning model with lexical-based features obtained the best performance in Javanese-Indonesian, while the LSTM network achieved the best performance in Sundanese-Indonesian. We also found that translating the code-mixed data into more resource-rich languages could not help to improve the classification performance.","PeriodicalId":315721,"journal":{"name":"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582768.3582771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Nowadays, social media becomes an important media for online communication, facilitating its users to publish content and providing a medium to express their opinions and feelings about anything. At the same time, abusive language is becoming a relevant problem on social media platforms such as Facebook and Twitter. Geographically, Indonesia consists of several regions with their own local languages. A recent report shows 718 local languages used by different regions and tribes in Indonesia. Indonesian tend to use a mix of their own local language and Bahasa to communicate on social media platforms, such as Twitter. Similar to other languages, code-mixed is also becoming the main issue and challenge of detecting hate speech in Indonesian social media. In this study, we conduct a preliminary experiment to study the detection of hate speech in Indonesian social media, specifically Twitter. Our experiment used 6,115 tweets in Indonesian-Javanese code-mixed and 2,945 tweets in Indonesian-Sundanese code-mixed. The overall results show that the traditional machine learning model with lexical-based features obtained the best performance in Javanese-Indonesian, while the LSTM network achieved the best performance in Sundanese-Indonesian. We also found that translating the code-mixed data into more resource-rich languages could not help to improve the classification performance.