Deep Pre-trained Contrastive Self-Supervised Learning: A Cyberbullying Detection Approach with Augmented Datasets

Lulwah M. Al-Harigy, H. Al-Nuaim, N. Moradpoor, Zhiyuan Tan
{"title":"Deep Pre-trained Contrastive Self-Supervised Learning: A Cyberbullying Detection Approach with Augmented Datasets","authors":"Lulwah M. Al-Harigy, H. Al-Nuaim, N. Moradpoor, Zhiyuan Tan","doi":"10.1109/CICN56167.2022.10008274","DOIUrl":null,"url":null,"abstract":"Cyberbullying is a widespread problem that has only increased in recent years due to the massive dependence on social media. Although, there are many approaches for detecting cyberbullying they still need to be improved upon for more accurate detection. We need new approaches that understand the context of the words used in cyberbullying by generating different representations of each word. In addition. there is a large amount of unlabelled data on the Internet that needs to be labelled for a more accurate detection process. Even though multiple methods for annotating datasets exists, the most widely used are still manual approaches, either using experts or crowdsourcing. However, The time needed and high cost of labor for manually annotation approaches result in a lack of annotated social network datasets for training a robust cyberbullying detector. Automated approaches can be relied upon in labelling data, such as using the Self-Supervised Learning (SSL) model. In this paper, we proposed two main parts. The first part is proposing a model of parallel BERT + Bi-LSTM used for detecting cyberbullying terms. The second part is utilizing Contrastive Self-Supervised Learning (a form of SSL) to augment the training set from unlabeled data using a small portion of another manually annotated dataset. Our proposed model that used deep pre-trained contrastive self-supervised learning for detecting cyberbullying using augmented datasets achieved a performance of (0.9311) using macro average F1 score. This result shows our model outperformed the baseline models - the top three teams in the competition SemEval-2020 Task 12.","PeriodicalId":287589,"journal":{"name":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICN56167.2022.10008274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Cyberbullying is a widespread problem that has only increased in recent years due to the massive dependence on social media. Although, there are many approaches for detecting cyberbullying they still need to be improved upon for more accurate detection. We need new approaches that understand the context of the words used in cyberbullying by generating different representations of each word. In addition. there is a large amount of unlabelled data on the Internet that needs to be labelled for a more accurate detection process. Even though multiple methods for annotating datasets exists, the most widely used are still manual approaches, either using experts or crowdsourcing. However, The time needed and high cost of labor for manually annotation approaches result in a lack of annotated social network datasets for training a robust cyberbullying detector. Automated approaches can be relied upon in labelling data, such as using the Self-Supervised Learning (SSL) model. In this paper, we proposed two main parts. The first part is proposing a model of parallel BERT + Bi-LSTM used for detecting cyberbullying terms. The second part is utilizing Contrastive Self-Supervised Learning (a form of SSL) to augment the training set from unlabeled data using a small portion of another manually annotated dataset. Our proposed model that used deep pre-trained contrastive self-supervised learning for detecting cyberbullying using augmented datasets achieved a performance of (0.9311) using macro average F1 score. This result shows our model outperformed the baseline models - the top three teams in the competition SemEval-2020 Task 12.
深度预训练对比自监督学习:基于增强数据集的网络欺凌检测方法
网络欺凌是一个普遍存在的问题,近年来由于对社交媒体的大量依赖而加剧。尽管有很多检测网络欺凌的方法,但为了更准确的检测,它们仍然需要改进。我们需要新的方法,通过生成每个单词的不同表示来理解网络欺凌中使用的单词的上下文。此外。互联网上有大量未标记的数据需要标记,以便更准确地检测过程。尽管存在多种注释数据集的方法,但最广泛使用的仍然是手动方法,要么使用专家,要么使用众包。然而,手动标注方法所需的时间和高昂的人工成本导致缺乏用于训练鲁棒网络欺凌检测器的标注社交网络数据集。自动化方法可以用于标记数据,例如使用自监督学习(Self-Supervised Learning, SSL)模型。在本文中,我们提出两个主要部分。第一部分提出了一种用于检测网络欺凌术语的并行BERT + Bi-LSTM模型。第二部分是利用对比自监督学习(SSL的一种形式),使用另一个手动注释数据集的一小部分从未标记的数据中扩展训练集。我们提出的模型使用深度预训练的对比自监督学习来检测使用增强数据集的网络欺凌,使用宏观平均F1分数获得了(0.9311)的性能。这个结果表明,我们的模型优于基准模型——SemEval-2020 Task 12竞赛中的前三名团队。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信