“Zo Grof !”: A Comprehensive Corpus for Offensive and Abusive Language in Dutch

Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH) Pub Date : 1900-01-01 DOI:10.18653/v1/2022.woah-1.5

Ward Ruitenbeek, Victor Zwart, Robin Van Der Noord, Zhenja Gnezdilov, T. Caselli

引用次数: 3

Abstract

This paper presents a comprehensive corpus for the study of socially unacceptable language in Dutch. The corpus extends and revise an existing resource with more data and introduces a new annotation dimension for offensive language, making it a unique resource in the Dutch language panorama. Each language phenomenon (abusive and offensive language) in the corpus has been annotated with a multi-layer annotation scheme modelling the explicitness and the target(s) of the message. We have conducted a new set of experiments with different classification algorithms on all annotation dimensions. Monolingual Pre-Trained Language Models prove as the best systems, obtaining a macro-average F1 of 0.828 for binary classification of offensive language, and 0.579 for the targets of offensive messages. Furthermore, the best system obtains a macro-average F1 of 0.667 for distinguishing between abusive and offensive messages.

查看原文本刊更多论文

“Zo Grof !”:荷兰语侮辱性语言综合语料库

本文提供了一个全面的语料库，用于研究荷兰语中社会不可接受的语言。该语料库对现有资源进行了扩展和修订，增加了更多的数据，并为攻击性语言引入了新的注释维度，使其成为荷兰语全景中的独特资源。语料库中的每种语言现象(辱骂性和攻击性语言)都使用多层注释方案进行注释，该方案对消息的显式性和目标进行建模。我们在所有标注维度上使用不同的分类算法进行了一组新的实验。单语预训练语言模型被证明是最好的系统，对于攻击性语言的二元分类，其宏观平均F1为0.828，对于攻击性信息的目标，其宏观平均F1为0.579。此外，最佳系统在区分辱骂性和攻击性信息方面获得了0.667的宏观平均F1。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

自引率

0.00%

发文量