Cyberbullying Text Identification based on Deep Learning and Transformer-based Language Models

Q2 Engineering
Khalid Saifullah, Muhammad Ibrahim Khan, Suhaima Jamal, Iqbal H. Sarker
{"title":"Cyberbullying Text Identification based on Deep Learning and Transformer-based Language Models","authors":"Khalid Saifullah, Muhammad Ibrahim Khan, Suhaima Jamal, Iqbal H. Sarker","doi":"10.4108/eetinis.v11i1.4703","DOIUrl":null,"url":null,"abstract":"In the contemporary digital age, social media platforms like Facebook, Twitter, and YouTube serve as vital channels for individuals to express ideas and connect with others. Despite fostering increased connectivity, these platforms have inadvertently given rise to negative behaviors, particularly cyberbullying. While extensive research has been conducted on high-resource languages such as English, there is a notable scarcity of resources for low-resource languages like Bengali, Arabic, Tamil, etc., particularly in terms of language modeling. This study addresses this gap by developing a cyberbullying text identification system called BullyFilterNeT tailored for social media texts, considering Bengali as a test case. The intelligent BullyFilterNeT system devised overcomes Out-of-Vocabulary (OOV) challenges associated with non-contextual embeddings and addresses the limitations of context-aware feature representations. To facilitate a comprehensive understanding, three non-contextual embedding models GloVe, FastText, and Word2Vec are developed for feature extraction in Bengali. These embedding models are utilized in the classification models, employing three statistical models (SVM, SGD, Libsvm), and four deep learning models (CNN, VDCNN, LSTM, GRU). Additionally, the study employs six transformer-based language models: mBERT, bELECTRA, IndicBERT, XML-RoBERTa, DistilBERT, and BanglaBERT, respectively to overcome the limitations of earlier models. Remarkably, BanglaBERT-based BullyFilterNeT achieves the highest accuracy of 88.04% in our test set, underscoring its effectiveness in cyberbullying text identification in the Bengali language.","PeriodicalId":33474,"journal":{"name":"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems","volume":"14 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4108/eetinis.v11i1.4703","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Engineering","Score":null,"Total":0}
引用次数: 0

Abstract

In the contemporary digital age, social media platforms like Facebook, Twitter, and YouTube serve as vital channels for individuals to express ideas and connect with others. Despite fostering increased connectivity, these platforms have inadvertently given rise to negative behaviors, particularly cyberbullying. While extensive research has been conducted on high-resource languages such as English, there is a notable scarcity of resources for low-resource languages like Bengali, Arabic, Tamil, etc., particularly in terms of language modeling. This study addresses this gap by developing a cyberbullying text identification system called BullyFilterNeT tailored for social media texts, considering Bengali as a test case. The intelligent BullyFilterNeT system devised overcomes Out-of-Vocabulary (OOV) challenges associated with non-contextual embeddings and addresses the limitations of context-aware feature representations. To facilitate a comprehensive understanding, three non-contextual embedding models GloVe, FastText, and Word2Vec are developed for feature extraction in Bengali. These embedding models are utilized in the classification models, employing three statistical models (SVM, SGD, Libsvm), and four deep learning models (CNN, VDCNN, LSTM, GRU). Additionally, the study employs six transformer-based language models: mBERT, bELECTRA, IndicBERT, XML-RoBERTa, DistilBERT, and BanglaBERT, respectively to overcome the limitations of earlier models. Remarkably, BanglaBERT-based BullyFilterNeT achieves the highest accuracy of 88.04% in our test set, underscoring its effectiveness in cyberbullying text identification in the Bengali language.
基于深度学习和变换器语言模型的网络欺凌文本识别
在当代数字时代,Facebook、Twitter 和 YouTube 等社交媒体平台成为个人表达想法和与他人联系的重要渠道。尽管这些平台促进了更多的联系,但也在无意中引发了负面行为,尤其是网络欺凌。虽然针对英语等高资源语言开展了大量研究,但针对孟加拉语、阿拉伯语、泰米尔语等低资源语言的资源却明显匮乏,尤其是在语言建模方面。本研究以孟加拉语为测试案例,开发了一个专为社交媒体文本定制的名为 BullyFilterNeT 的网络欺凌文本识别系统,从而填补了这一空白。所设计的智能 BullyFilterNeT 系统克服了与非上下文嵌入相关的词汇缺失(OOV)难题,并解决了上下文感知特征表征的局限性。为了便于全面理解,开发了三种非上下文嵌入模型 GloVe、FastText 和 Word2Vec,用于孟加拉语的特征提取。这些嵌入模型被用于分类模型中,采用了三种统计模型(SVM、SGD、Libsvm)和四种深度学习模型(CNN、VDCNN、LSTM、GRU)。此外,研究还采用了六种基于转换器的语言模型:mBERT、bELECTRA、IndicBERT、XML-RoBERTa、DistilBERT 和 BanglaBERT,以克服早期模型的局限性。值得注意的是,基于 BanglaBERT 的 BullyFilterNeT 在我们的测试集中达到了 88.04% 的最高准确率,凸显了其在孟加拉语网络欺凌文本识别中的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
4.00
自引率
0.00%
发文量
15
审稿时长
10 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信