Expert-Annotated Dataset to Study Cyberbullying in Polish Language

IF 2 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Pub Date : 2023-12-20 DOI:10.3390/data9010001

Michal Ptaszynski, Agata Pieciukiewicz, Pawel Dybala, Paweł Skrzek, Kamil Soliwoda, Marcin Fortuna, Gniewosz Leliwa, Michal Wroczynski

{"title":"Expert-Annotated Dataset to Study Cyberbullying in Polish Language","authors":"Michal Ptaszynski, Agata Pieciukiewicz, Pawel Dybala, Paweł Skrzek, Kamil Soliwoda, Marcin Fortuna, Gniewosz Leliwa, Michal Wroczynski","doi":"10.3390/data9010001","DOIUrl":null,"url":null,"abstract":"We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.","PeriodicalId":36824,"journal":{"name":"Data","volume":"35 4","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.3390/data9010001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.

查看原文本刊更多论文

研究波兰语网络欺凌的专家注释数据集

我们介绍了首个从波兰互联网收集的有害和攻击性语言数据集。我们对该数据集进行了精心策划，以促进对网络欺凌和仇恨言论等有害网络现象的探索。该数据集采用两种方法进行系统收集和注释。首先，由两名熟练的非专业志愿者在网络欺凌和仇恨言论语言专家的指导下进行注释。为了提高注释的精确度，由长期从事网络欺凌和仇恨言论注释工作的专业注释员团队进行了第二轮注释。第二阶段由一名经验丰富的注释员作为超级注释员进一步监督。在最初的应用中，该数据集被用于对波兰语中的网络欺凌实例进行分类。具体来说，该数据集是两项不同任务的基础：(1) 区分有害信息和非有害信息的二元分类；(2) 区分有害内容（网络欺凌和仇恨言论）的两种变体以及非有害类别的多类分类。除了数据集本身，我们还提供了分类效果令人满意的模型。这些模型可供第三方用于构建网络欺凌预防系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊