Systematic Analysis of Hateful Text Detection Using Machine Learning Classifiers

Tanzina Akter Tani, Tabassum Islam, Sayed Atique Newaz, N. Sultana
{"title":"Systematic Analysis of Hateful Text Detection Using Machine Learning Classifiers","authors":"Tanzina Akter Tani, Tabassum Islam, Sayed Atique Newaz, N. Sultana","doi":"10.1109/ICTS52701.2021.9608010","DOIUrl":null,"url":null,"abstract":"In today's internet-based world, social media is one of the most popular platforms through which users can outburst their different types of feelings, emotions, frustration, anger, happiness etc. without having concern about distinguishes between moral and social values. These kinds of abusive or offensive texts cause social disturbances, crimes, and many unethical deeds. So, there is a huge necessity to distinguish these kinds of abusive texts/posts and remove them from social media. Different researchers have distinguished different text detection processes in their related work. In our proposed work, three classifiers have been used: Naïve Bayes (NB), Random Forest (RF), and Support Vector Machine (SVM) for detecting hateful text. Bag of Words (BoW) and TF-IDF feature extraction methods have been used to compare these three classifiers for both unigram and bigrams words. To balance hateful and clean content, the Twitter dataset has been under-sampled. Text preprocessing is essential for NLP to produce better and more accurate results which have been carried out in this work. In our result, Naive Bayes has provided the highest accuracy (89%) using the TF-IDF feature extraction model, whereas Random Forest has provided the most accuracy (88%) using Bag of words (BoW) in the case of unigram word. Overall, we got much better performance using unigram than using bigrams word. Finally, we made a number of principle contributions.","PeriodicalId":6738,"journal":{"name":"2021 13th International Conference on Information & Communication Technology and System (ICTS)","volume":"339 1","pages":"330-335"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Information & Communication Technology and System (ICTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTS52701.2021.9608010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In today's internet-based world, social media is one of the most popular platforms through which users can outburst their different types of feelings, emotions, frustration, anger, happiness etc. without having concern about distinguishes between moral and social values. These kinds of abusive or offensive texts cause social disturbances, crimes, and many unethical deeds. So, there is a huge necessity to distinguish these kinds of abusive texts/posts and remove them from social media. Different researchers have distinguished different text detection processes in their related work. In our proposed work, three classifiers have been used: Naïve Bayes (NB), Random Forest (RF), and Support Vector Machine (SVM) for detecting hateful text. Bag of Words (BoW) and TF-IDF feature extraction methods have been used to compare these three classifiers for both unigram and bigrams words. To balance hateful and clean content, the Twitter dataset has been under-sampled. Text preprocessing is essential for NLP to produce better and more accurate results which have been carried out in this work. In our result, Naive Bayes has provided the highest accuracy (89%) using the TF-IDF feature extraction model, whereas Random Forest has provided the most accuracy (88%) using Bag of words (BoW) in the case of unigram word. Overall, we got much better performance using unigram than using bigrams word. Finally, we made a number of principle contributions.
基于机器学习分类器的仇恨文本检测系统分析
在当今以互联网为基础的世界,社交媒体是最受欢迎的平台之一,用户可以通过社交媒体来发泄他们不同类型的感受、情绪、沮丧、愤怒、快乐等,而不必担心道德和社会价值观的区别。这类辱骂性或攻击性的短信会引起社会骚乱、犯罪和许多不道德的行为。因此,有必要区分这些类型的辱骂文本/帖子并将其从社交媒体中删除。不同的研究者在他们的相关工作中区分了不同的文本检测过程。在我们提出的工作中,使用了三种分类器:Naïve贝叶斯(NB),随机森林(RF)和支持向量机(SVM)来检测仇恨文本。单词袋(BoW)和TF-IDF特征提取方法被用来比较这三种分类器对单字和双字的分类。为了平衡仇恨和干净的内容,Twitter数据集的采样不足。文本预处理是自然语言处理产生更好、更准确结果的必要条件。在我们的结果中,使用TF-IDF特征提取模型的朴素贝叶斯提供了最高的准确率(89%),而在单字母单词的情况下,使用词袋(BoW)的随机森林提供了最高的准确率(88%)。总的来说,我们使用单字符比使用双字符获得了更好的性能。最后,我们做出了一些原则性的贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信