基于深度学习和变换器语言模型的网络欺凌文本识别

Q2 Engineering

EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Pub Date : 2024-02-22 DOI:10.4108/eetinis.v11i1.4703

Khalid Saifullah, Muhammad Ibrahim Khan, Suhaima Jamal, Iqbal H. Sarker

{"title":"基于深度学习和变换器语言模型的网络欺凌文本识别","authors":"Khalid Saifullah, Muhammad Ibrahim Khan, Suhaima Jamal, Iqbal H. Sarker","doi":"10.4108/eetinis.v11i1.4703","DOIUrl":null,"url":null,"abstract":"In the contemporary digital age, social media platforms like Facebook, Twitter, and YouTube serve as vital channels for individuals to express ideas and connect with others. Despite fostering increased connectivity, these platforms have inadvertently given rise to negative behaviors, particularly cyberbullying. While extensive research has been conducted on high-resource languages such as English, there is a notable scarcity of resources for low-resource languages like Bengali, Arabic, Tamil, etc., particularly in terms of language modeling. This study addresses this gap by developing a cyberbullying text identification system called BullyFilterNeT tailored for social media texts, considering Bengali as a test case. The intelligent BullyFilterNeT system devised overcomes Out-of-Vocabulary (OOV) challenges associated with non-contextual embeddings and addresses the limitations of context-aware feature representations. To facilitate a comprehensive understanding, three non-contextual embedding models GloVe, FastText, and Word2Vec are developed for feature extraction in Bengali. These embedding models are utilized in the classification models, employing three statistical models (SVM, SGD, Libsvm), and four deep learning models (CNN, VDCNN, LSTM, GRU). Additionally, the study employs six transformer-based language models: mBERT, bELECTRA, IndicBERT, XML-RoBERTa, DistilBERT, and BanglaBERT, respectively to overcome the limitations of earlier models. Remarkably, BanglaBERT-based BullyFilterNeT achieves the highest accuracy of 88.04% in our test set, underscoring its effectiveness in cyberbullying text identification in the Bengali language.","PeriodicalId":33474,"journal":{"name":"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems","volume":"14 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cyberbullying Text Identification based on Deep Learning and Transformer-based Language Models\",\"authors\":\"Khalid Saifullah, Muhammad Ibrahim Khan, Suhaima Jamal, Iqbal H. Sarker\",\"doi\":\"10.4108/eetinis.v11i1.4703\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the contemporary digital age, social media platforms like Facebook, Twitter, and YouTube serve as vital channels for individuals to express ideas and connect with others. Despite fostering increased connectivity, these platforms have inadvertently given rise to negative behaviors, particularly cyberbullying. While extensive research has been conducted on high-resource languages such as English, there is a notable scarcity of resources for low-resource languages like Bengali, Arabic, Tamil, etc., particularly in terms of language modeling. This study addresses this gap by developing a cyberbullying text identification system called BullyFilterNeT tailored for social media texts, considering Bengali as a test case. The intelligent BullyFilterNeT system devised overcomes Out-of-Vocabulary (OOV) challenges associated with non-contextual embeddings and addresses the limitations of context-aware feature representations. To facilitate a comprehensive understanding, three non-contextual embedding models GloVe, FastText, and Word2Vec are developed for feature extraction in Bengali. These embedding models are utilized in the classification models, employing three statistical models (SVM, SGD, Libsvm), and four deep learning models (CNN, VDCNN, LSTM, GRU). Additionally, the study employs six transformer-based language models: mBERT, bELECTRA, IndicBERT, XML-RoBERTa, DistilBERT, and BanglaBERT, respectively to overcome the limitations of earlier models. Remarkably, BanglaBERT-based BullyFilterNeT achieves the highest accuracy of 88.04% in our test set, underscoring its effectiveness in cyberbullying text identification in the Bengali language.\",\"PeriodicalId\":33474,\"journal\":{\"name\":\"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems\",\"volume\":\"14 12\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4108/eetinis.v11i1.4703\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Engineering\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"EAI Endorsed Transactions on Industrial Networks and Intelligent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4108/eetinis.v11i1.4703","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Engineering","Score":null,"Total":0}

引用次数: 0

摘要

在当代数字时代，Facebook、Twitter 和 YouTube 等社交媒体平台成为个人表达想法和与他人联系的重要渠道。尽管这些平台促进了更多的联系，但也在无意中引发了负面行为，尤其是网络欺凌。虽然针对英语等高资源语言开展了大量研究，但针对孟加拉语、阿拉伯语、泰米尔语等低资源语言的资源却明显匮乏，尤其是在语言建模方面。本研究以孟加拉语为测试案例，开发了一个专为社交媒体文本定制的名为 BullyFilterNeT 的网络欺凌文本识别系统，从而填补了这一空白。所设计的智能 BullyFilterNeT 系统克服了与非上下文嵌入相关的词汇缺失（OOV）难题，并解决了上下文感知特征表征的局限性。为了便于全面理解，开发了三种非上下文嵌入模型 GloVe、FastText 和 Word2Vec，用于孟加拉语的特征提取。这些嵌入模型被用于分类模型中，采用了三种统计模型（SVM、SGD、Libsvm）和四种深度学习模型（CNN、VDCNN、LSTM、GRU）。此外，研究还采用了六种基于转换器的语言模型：mBERT、bELECTRA、IndicBERT、XML-RoBERTa、DistilBERT 和 BanglaBERT，以克服早期模型的局限性。值得注意的是，基于 BanglaBERT 的 BullyFilterNeT 在我们的测试集中达到了 88.04% 的最高准确率，凸显了其在孟加拉语网络欺凌文本识别中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cyberbullying Text Identification based on Deep Learning and Transformer-based Language Models

In the contemporary digital age, social media platforms like Facebook, Twitter, and YouTube serve as vital channels for individuals to express ideas and connect with others. Despite fostering increased connectivity, these platforms have inadvertently given rise to negative behaviors, particularly cyberbullying. While extensive research has been conducted on high-resource languages such as English, there is a notable scarcity of resources for low-resource languages like Bengali, Arabic, Tamil, etc., particularly in terms of language modeling. This study addresses this gap by developing a cyberbullying text identification system called BullyFilterNeT tailored for social media texts, considering Bengali as a test case. The intelligent BullyFilterNeT system devised overcomes Out-of-Vocabulary (OOV) challenges associated with non-contextual embeddings and addresses the limitations of context-aware feature representations. To facilitate a comprehensive understanding, three non-contextual embedding models GloVe, FastText, and Word2Vec are developed for feature extraction in Bengali. These embedding models are utilized in the classification models, employing three statistical models (SVM, SGD, Libsvm), and four deep learning models (CNN, VDCNN, LSTM, GRU). Additionally, the study employs six transformer-based language models: mBERT, bELECTRA, IndicBERT, XML-RoBERTa, DistilBERT, and BanglaBERT, respectively to overcome the limitations of earlier models. Remarkably, BanglaBERT-based BullyFilterNeT achieves the highest accuracy of 88.04% in our test set, underscoring its effectiveness in cyberbullying text identification in the Bengali language.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Computer Science-Information Systems

CiteScore

4.00

自引率

0.00%

发文量

审稿时长

10 weeks