Endang Wahyu Pamungkas, Divi Galih Prasetyo Putri, A. Fatmawati
{"title":"Hate Speech Detection in Bahasa Indonesia: Challenges and Opportunities","authors":"Endang Wahyu Pamungkas, Divi Galih Prasetyo Putri, A. Fatmawati","doi":"10.14569/ijacsa.2023.01406125","DOIUrl":null,"url":null,"abstract":"This study aims to provide an overview of the current research on detecting abusive language in Indonesian social media. The study examines existing datasets, methods, and challenges and opportunities in this field. The research found that most existing datasets for detecting abusive language were collected from social media platforms such as Twitter, Facebook, and Instagram, with Twitter being the most commonly used source. The study also found that hate speech is the most researched type of abusive language. Various models, including traditional machine learning and deep learning approaches, have been implemented for this task, with deep learning models showing more competitive results. However, the use of transformer-based models is less popular in Indonesian hate speech studies. The study also emphasizes the importance of exploring more diverse phenomena, such as islamophobia and political hate speech. Additionally, the study suggests crowdsourcing as a potential solution for the annotation approach for labeling datasets. Furthermore, it encourages researchers to consider code-mixing issues in abusive language datasets in Indonesia, as it could improve the overall model performance for detecting abusive language in Indonesian data. The study also suggests that the lack of effective regulations and the anonymity afforded to users on most social networking sites, as well as the increasing number of Twitter users in Indonesia, have contributed to the rising prevalence of hate speech in Indonesian social media. The study also notes the importance of considering code-mixed language, out-of-vocabulary words, grammatical errors, and limited context when working with social media data. Keywords—Abusive language; hate speech detection; machine learning; social media","PeriodicalId":13824,"journal":{"name":"International Journal of Advanced Computer Science and Applications","volume":"5 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Advanced Computer Science and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14569/ijacsa.2023.01406125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
This study aims to provide an overview of the current research on detecting abusive language in Indonesian social media. The study examines existing datasets, methods, and challenges and opportunities in this field. The research found that most existing datasets for detecting abusive language were collected from social media platforms such as Twitter, Facebook, and Instagram, with Twitter being the most commonly used source. The study also found that hate speech is the most researched type of abusive language. Various models, including traditional machine learning and deep learning approaches, have been implemented for this task, with deep learning models showing more competitive results. However, the use of transformer-based models is less popular in Indonesian hate speech studies. The study also emphasizes the importance of exploring more diverse phenomena, such as islamophobia and political hate speech. Additionally, the study suggests crowdsourcing as a potential solution for the annotation approach for labeling datasets. Furthermore, it encourages researchers to consider code-mixing issues in abusive language datasets in Indonesia, as it could improve the overall model performance for detecting abusive language in Indonesian data. The study also suggests that the lack of effective regulations and the anonymity afforded to users on most social networking sites, as well as the increasing number of Twitter users in Indonesia, have contributed to the rising prevalence of hate speech in Indonesian social media. The study also notes the importance of considering code-mixed language, out-of-vocabulary words, grammatical errors, and limited context when working with social media data. Keywords—Abusive language; hate speech detection; machine learning; social media
期刊介绍:
IJACSA is a scholarly computer science journal representing the best in research. Its mission is to provide an outlet for quality research to be publicised and published to a global audience. The journal aims to publish papers selected through rigorous double-blind peer review to ensure originality, timeliness, relevance, and readability. In sync with the Journal''s vision "to be a respected publication that publishes peer reviewed research articles, as well as review and survey papers contributed by International community of Authors", we have drawn reviewers and editors from Institutions and Universities across the globe. A double blind peer review process is conducted to ensure that we retain high standards. At IJACSA, we stand strong because we know that global challenges make way for new innovations, new ways and new talent. International Journal of Advanced Computer Science and Applications publishes carefully refereed research, review and survey papers which offer a significant contribution to the computer science literature, and which are of interest to a wide audience. Coverage extends to all main-stream branches of computer science and related applications