Expert-Annotated Dataset to Study Cyberbullying in Polish Language

IF 2.2 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Data Pub Date : 2023-12-20 DOI:10.3390/data9010001
Michal Ptaszynski, Agata Pieciukiewicz, Pawel Dybala, Paweł Skrzek, Kamil Soliwoda, Marcin Fortuna, Gniewosz Leliwa, Michal Wroczynski
{"title":"Expert-Annotated Dataset to Study Cyberbullying in Polish Language","authors":"Michal Ptaszynski, Agata Pieciukiewicz, Pawel Dybala, Paweł Skrzek, Kamil Soliwoda, Marcin Fortuna, Gniewosz Leliwa, Michal Wroczynski","doi":"10.3390/data9010001","DOIUrl":null,"url":null,"abstract":"We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.","PeriodicalId":36824,"journal":{"name":"Data","volume":"35 4","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.3390/data9010001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems.
研究波兰语网络欺凌的专家注释数据集
我们介绍了首个从波兰互联网收集的有害和攻击性语言数据集。我们对该数据集进行了精心策划,以促进对网络欺凌和仇恨言论等有害网络现象的探索。该数据集采用两种方法进行系统收集和注释。首先,由两名熟练的非专业志愿者在网络欺凌和仇恨言论语言专家的指导下进行注释。为了提高注释的精确度,由长期从事网络欺凌和仇恨言论注释工作的专业注释员团队进行了第二轮注释。第二阶段由一名经验丰富的注释员作为超级注释员进一步监督。在最初的应用中,该数据集被用于对波兰语中的网络欺凌实例进行分类。具体来说,该数据集是两项不同任务的基础:(1) 区分有害信息和非有害信息的二元分类;(2) 区分有害内容(网络欺凌和仇恨言论)的两种变体以及非有害类别的多类分类。除了数据集本身,我们还提供了分类效果令人满意的模型。这些模型可供第三方用于构建网络欺凌预防系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Data
Data Decision Sciences-Information Systems and Management
CiteScore
4.30
自引率
3.80%
发文量
0
审稿时长
10 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信