融合Transformer-XL与双向循环网络的网络欺凌检测。

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2025-06-17 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2940

Md Mithun Hossain, Md Shakil Hossain, Md Shakhawat Hossain, M Firoz Mridha, Mejdl Safran, Sultan Alfarhood, Dunren Che

{"title":"融合Transformer-XL与双向循环网络的网络欺凌检测。","authors":"Md Mithun Hossain, Md Shakil Hossain, Md Shakhawat Hossain, M Firoz Mridha, Mejdl Safran, Sultan Alfarhood, Dunren Che","doi":"10.7717/peerj-cs.2940","DOIUrl":null,"url":null,"abstract":"Identifying cyberbullying in languages other than English presents distinct difficulties owing to linguistic subtleties and scarcity of annotated datasets. This article presents a new method for identifying cyberbullying in Bengali text data using the Kaggle dataset. This strategy combines Transformer-Extra Large (XL) with bi-directional recurrent neural networks (BiGRU-BiLSTM). Extensive data preparation was performed, including data cleaning, data analysis, and label encoding. Upsampling methods were used to handle imbalanced classes, and data augmentation enhanced the training dataset. We carried out tokenization of the text using a pre-trained tokenizer to capture semantic representations accurately. The model we presented, Transformer-XL-bidirectional gated recurrent units (BiGRU)-bidirectional long short-term memory (BiLSTM), which is called Fusion Transformer-XL, surpassed the performance of the baseline models, attaining an accuracy of 98.17% and an F1-score of 98.18%. Local interpretable model-agnostic explanation (LIME) text explanations were used to understand the reasoning behind the model's choices and performed the cross-dataset evaluation of the model using the English dataset. This helped improve the clarity and reliability of the proposed method. Furthermore, implementing k-fold cross-validation ensures our model's robustness and adaptability across diverse data categories. The results of our study demonstrate the effectiveness of combining Transformer-XL with bi-directional recurrent networks for detecting cyberbullying in Bengali. This emphasizes the significance of using hybrid architectures to address intricate natural language processing problems in languages with limited resources. This study enhances the development of methods for detecting cyberbullying and opens up opportunities for additional investigation into language diversity and social media analytics.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2940"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12193458/pdf/","citationCount":"0","resultStr":"{\"title\":\"Fusing Transformer-XL with bi-directional recurrent networks for cyberbullying detection.\",\"authors\":\"Md Mithun Hossain, Md Shakil Hossain, Md Shakhawat Hossain, M Firoz Mridha, Mejdl Safran, Sultan Alfarhood, Dunren Che\",\"doi\":\"10.7717/peerj-cs.2940\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Identifying cyberbullying in languages other than English presents distinct difficulties owing to linguistic subtleties and scarcity of annotated datasets. This article presents a new method for identifying cyberbullying in Bengali text data using the Kaggle dataset. This strategy combines Transformer-Extra Large (XL) with bi-directional recurrent neural networks (BiGRU-BiLSTM). Extensive data preparation was performed, including data cleaning, data analysis, and label encoding. Upsampling methods were used to handle imbalanced classes, and data augmentation enhanced the training dataset. We carried out tokenization of the text using a pre-trained tokenizer to capture semantic representations accurately. The model we presented, Transformer-XL-bidirectional gated recurrent units (BiGRU)-bidirectional long short-term memory (BiLSTM), which is called Fusion Transformer-XL, surpassed the performance of the baseline models, attaining an accuracy of 98.17% and an F1-score of 98.18%. Local interpretable model-agnostic explanation (LIME) text explanations were used to understand the reasoning behind the model's choices and performed the cross-dataset evaluation of the model using the English dataset. This helped improve the clarity and reliability of the proposed method. Furthermore, implementing k-fold cross-validation ensures our model's robustness and adaptability across diverse data categories. The results of our study demonstrate the effectiveness of combining Transformer-XL with bi-directional recurrent networks for detecting cyberbullying in Bengali. This emphasizes the significance of using hybrid architectures to address intricate natural language processing problems in languages with limited resources. This study enhances the development of methods for detecting cyberbullying and opens up opportunities for additional investigation into language diversity and social media analytics.\",\"PeriodicalId\":54224,\"journal\":{\"name\":\"PeerJ Computer Science\",\"volume\":\"11 \",\"pages\":\"e2940\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12193458/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PeerJ Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.7717/peerj-cs.2940\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2940","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于语言的微妙之处和缺乏注释数据集，识别非英语语言的网络欺凌存在明显的困难。本文提出了一种使用Kaggle数据集识别孟加拉文本数据中的网络欺凌的新方法。该策略将特大变压器（XL）与双向循环神经网络（BiGRU-BiLSTM）相结合。进行了大量的数据准备，包括数据清理、数据分析和标签编码。采用上采样方法处理不平衡类，数据增强增强训练数据集。我们使用预训练的标记器对文本进行标记，以准确捕获语义表示。我们提出的模型，Transformer-XL-双向门控循环单元(BiGRU)-双向长短期记忆（BiLSTM），被称为Fusion Transformer-XL，超越了基线模型的性能，达到了98.17%的准确率和98.18%的f1分数。使用局部可解释的模型不可知解释（LIME）文本解释来理解模型选择背后的原因，并使用英语数据集对模型进行跨数据集评估。这有助于提高所提方法的清晰度和可靠性。此外，实现k-fold交叉验证确保了我们的模型在不同数据类别中的鲁棒性和适应性。我们的研究结果证明了将Transformer-XL与双向循环网络结合起来检测孟加拉语网络欺凌的有效性。这强调了在资源有限的语言中使用混合体系结构来解决复杂的自然语言处理问题的重要性。这项研究促进了网络欺凌检测方法的发展，并为语言多样性和社交媒体分析的进一步研究开辟了机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fusing Transformer-XL with bi-directional recurrent networks for cyberbullying detection.

Identifying cyberbullying in languages other than English presents distinct difficulties owing to linguistic subtleties and scarcity of annotated datasets. This article presents a new method for identifying cyberbullying in Bengali text data using the Kaggle dataset. This strategy combines Transformer-Extra Large (XL) with bi-directional recurrent neural networks (BiGRU-BiLSTM). Extensive data preparation was performed, including data cleaning, data analysis, and label encoding. Upsampling methods were used to handle imbalanced classes, and data augmentation enhanced the training dataset. We carried out tokenization of the text using a pre-trained tokenizer to capture semantic representations accurately. The model we presented, Transformer-XL-bidirectional gated recurrent units (BiGRU)-bidirectional long short-term memory (BiLSTM), which is called Fusion Transformer-XL, surpassed the performance of the baseline models, attaining an accuracy of 98.17% and an F1-score of 98.18%. Local interpretable model-agnostic explanation (LIME) text explanations were used to understand the reasoning behind the model's choices and performed the cross-dataset evaluation of the model using the English dataset. This helped improve the clarity and reliability of the proposed method. Furthermore, implementing k-fold cross-validation ensures our model's robustness and adaptability across diverse data categories. The results of our study demonstrate the effectiveness of combining Transformer-XL with bi-directional recurrent networks for detecting cyberbullying in Bengali. This emphasizes the significance of using hybrid architectures to address intricate natural language processing problems in languages with limited resources. This study enhances the development of methods for detecting cyberbullying and opens up opportunities for additional investigation into language diversity and social media analytics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.