Lulwah M. Al-Harigy, H. Al-Nuaim, N. Moradpoor, Zhiyuan Tan
{"title":"深度预训练对比自监督学习:基于增强数据集的网络欺凌检测方法","authors":"Lulwah M. Al-Harigy, H. Al-Nuaim, N. Moradpoor, Zhiyuan Tan","doi":"10.1109/CICN56167.2022.10008274","DOIUrl":null,"url":null,"abstract":"Cyberbullying is a widespread problem that has only increased in recent years due to the massive dependence on social media. Although, there are many approaches for detecting cyberbullying they still need to be improved upon for more accurate detection. We need new approaches that understand the context of the words used in cyberbullying by generating different representations of each word. In addition. there is a large amount of unlabelled data on the Internet that needs to be labelled for a more accurate detection process. Even though multiple methods for annotating datasets exists, the most widely used are still manual approaches, either using experts or crowdsourcing. However, The time needed and high cost of labor for manually annotation approaches result in a lack of annotated social network datasets for training a robust cyberbullying detector. Automated approaches can be relied upon in labelling data, such as using the Self-Supervised Learning (SSL) model. In this paper, we proposed two main parts. The first part is proposing a model of parallel BERT + Bi-LSTM used for detecting cyberbullying terms. The second part is utilizing Contrastive Self-Supervised Learning (a form of SSL) to augment the training set from unlabeled data using a small portion of another manually annotated dataset. Our proposed model that used deep pre-trained contrastive self-supervised learning for detecting cyberbullying using augmented datasets achieved a performance of (0.9311) using macro average F1 score. This result shows our model outperformed the baseline models - the top three teams in the competition SemEval-2020 Task 12.","PeriodicalId":287589,"journal":{"name":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep Pre-trained Contrastive Self-Supervised Learning: A Cyberbullying Detection Approach with Augmented Datasets\",\"authors\":\"Lulwah M. Al-Harigy, H. Al-Nuaim, N. Moradpoor, Zhiyuan Tan\",\"doi\":\"10.1109/CICN56167.2022.10008274\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cyberbullying is a widespread problem that has only increased in recent years due to the massive dependence on social media. Although, there are many approaches for detecting cyberbullying they still need to be improved upon for more accurate detection. We need new approaches that understand the context of the words used in cyberbullying by generating different representations of each word. In addition. there is a large amount of unlabelled data on the Internet that needs to be labelled for a more accurate detection process. Even though multiple methods for annotating datasets exists, the most widely used are still manual approaches, either using experts or crowdsourcing. However, The time needed and high cost of labor for manually annotation approaches result in a lack of annotated social network datasets for training a robust cyberbullying detector. Automated approaches can be relied upon in labelling data, such as using the Self-Supervised Learning (SSL) model. In this paper, we proposed two main parts. The first part is proposing a model of parallel BERT + Bi-LSTM used for detecting cyberbullying terms. The second part is utilizing Contrastive Self-Supervised Learning (a form of SSL) to augment the training set from unlabeled data using a small portion of another manually annotated dataset. Our proposed model that used deep pre-trained contrastive self-supervised learning for detecting cyberbullying using augmented datasets achieved a performance of (0.9311) using macro average F1 score. This result shows our model outperformed the baseline models - the top three teams in the competition SemEval-2020 Task 12.\",\"PeriodicalId\":287589,\"journal\":{\"name\":\"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CICN56167.2022.10008274\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICN56167.2022.10008274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Deep Pre-trained Contrastive Self-Supervised Learning: A Cyberbullying Detection Approach with Augmented Datasets
Cyberbullying is a widespread problem that has only increased in recent years due to the massive dependence on social media. Although, there are many approaches for detecting cyberbullying they still need to be improved upon for more accurate detection. We need new approaches that understand the context of the words used in cyberbullying by generating different representations of each word. In addition. there is a large amount of unlabelled data on the Internet that needs to be labelled for a more accurate detection process. Even though multiple methods for annotating datasets exists, the most widely used are still manual approaches, either using experts or crowdsourcing. However, The time needed and high cost of labor for manually annotation approaches result in a lack of annotated social network datasets for training a robust cyberbullying detector. Automated approaches can be relied upon in labelling data, such as using the Self-Supervised Learning (SSL) model. In this paper, we proposed two main parts. The first part is proposing a model of parallel BERT + Bi-LSTM used for detecting cyberbullying terms. The second part is utilizing Contrastive Self-Supervised Learning (a form of SSL) to augment the training set from unlabeled data using a small portion of another manually annotated dataset. Our proposed model that used deep pre-trained contrastive self-supervised learning for detecting cyberbullying using augmented datasets achieved a performance of (0.9311) using macro average F1 score. This result shows our model outperformed the baseline models - the top three teams in the competition SemEval-2020 Task 12.