{"title":"Sinhala Hate Speech Detection in Social Media Using Machine Learning and Deep Learning","authors":"W.S.S. Fernando, R. Weerasinghe, E.R.A.D. Bandara","doi":"10.1109/ICTer58063.2022.10024082","DOIUrl":null,"url":null,"abstract":"Communication and presentation of beliefs became easier than in previous decades due to the rapid rise of information technology and computer science. Because social media is accessible worldwide via the internet, anyone can simply target someone or a group who adheres to a different culture or belief. While everyone has the freedom to express their own opinions, it should not be destructive, and everyone has the right to be free of hate speech. Because there are no automatic mechanisms for detecting hate speech on social media, anyone can be readily targeted. Because social media service providers do not have extensive linguistic expertise of some languages, such as Sinhala, it may take a few days for them to delete hate-related comments from the material after they become aware of them. As a result, detecting hate speech in the Sinhala language is an urgent and crucial task. Machine learning and deep learning based algorithms were employed in this study to automatically recognize Sinhala hate speeches broadcast on social media. Bag of words, Tf-idf, Word2Vec, and FastText feature extraction methods were used to extract features from the comments. Logistic Regression, Multinomial Naïve Bayes, Support Vector Machine, XGBoost, Random Forest machine learning models and CNN, RNN, LSTM deep learning models were trained using two pre-collected datasets with different sizes. The best six models were then chosen and test set performances were shown. According to this study, FastText with RNN has the greatest AUC ROC 0.71 with 70% accuracy for the test set.","PeriodicalId":123176,"journal":{"name":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTer58063.2022.10024082","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Communication and presentation of beliefs became easier than in previous decades due to the rapid rise of information technology and computer science. Because social media is accessible worldwide via the internet, anyone can simply target someone or a group who adheres to a different culture or belief. While everyone has the freedom to express their own opinions, it should not be destructive, and everyone has the right to be free of hate speech. Because there are no automatic mechanisms for detecting hate speech on social media, anyone can be readily targeted. Because social media service providers do not have extensive linguistic expertise of some languages, such as Sinhala, it may take a few days for them to delete hate-related comments from the material after they become aware of them. As a result, detecting hate speech in the Sinhala language is an urgent and crucial task. Machine learning and deep learning based algorithms were employed in this study to automatically recognize Sinhala hate speeches broadcast on social media. Bag of words, Tf-idf, Word2Vec, and FastText feature extraction methods were used to extract features from the comments. Logistic Regression, Multinomial Naïve Bayes, Support Vector Machine, XGBoost, Random Forest machine learning models and CNN, RNN, LSTM deep learning models were trained using two pre-collected datasets with different sizes. The best six models were then chosen and test set performances were shown. According to this study, FastText with RNN has the greatest AUC ROC 0.71 with 70% accuracy for the test set.
由于信息技术和计算机科学的迅速崛起,信仰的交流和表达比过去几十年变得更加容易。由于社交媒体可以通过互联网在全球范围内访问,任何人都可以简单地针对坚持不同文化或信仰的个人或群体。虽然每个人都有表达自己意见的自由,但这不应该是破坏性的,每个人都有权不受仇恨言论的影响。由于没有自动检测社交媒体上仇恨言论的机制,任何人都很容易成为攻击目标。由于社交媒体服务提供商没有广泛的语言专业知识,例如僧伽罗语,他们可能需要几天的时间才能从材料中删除与仇恨相关的评论。因此,检测僧伽罗语的仇恨言论是一项紧迫而关键的任务。本研究采用机器学习和基于深度学习的算法来自动识别社交媒体上播放的僧伽罗仇恨言论。使用Bag of words、Tf-idf、Word2Vec和FastText特征提取方法从评论中提取特征。使用两个不同大小的预采集数据集训练Logistic回归、多项式Naïve贝叶斯、支持向量机、XGBoost、随机森林机器学习模型和CNN、RNN、LSTM深度学习模型。然后选出最佳的6个模型,并展示了测试集性能。根据本研究,对于测试集,使用RNN的FastText具有最大的AUC ROC 0.71,准确率为70%。