英语复杂攻击性语言标注方案与评价数据集

Alexis Palmer, Christine Carr, Melissa Robinson, Jordan Sanders
{"title":"英语复杂攻击性语言标注方案与评价数据集","authors":"Alexis Palmer, Christine Carr, Melissa Robinson, Jordan Sanders","doi":"10.21248/jlcl.34.2020.222","DOIUrl":null,"url":null,"abstract":"This paper presents a new, extensible annotation scheme for offensive language data sets. The annotation scheme expands coverage beyond fairly straightforward cases of offensive language to address several cases of complex, implicit, and/or pragmatically-triggered offensive language. We apply the annotation scheme to create a new Complex Offensive Language Data Set for English ( COLD-EN ). The primary purpose of this data set is to diagnose how well systems for automatic detection of abusive language are able to classify three types of complex offensive language: reclaimed slurs, offensive utterances containing pejorative adjectival nominalizations (and no slur terms), and utterances conveying offense through linguistic distancing. COLD offers a straightforward framework for error analysis. Our vision is that researchers will use this data set to diagnose the strengths and weaknesses of their offensive language detection systems. In this paper, we diagnose some strengths and weaknesses of a top-performing offensive language detection system by: a) using it to classify COLD , and b) investigating its performance on the 10 fine-grained categories supported by our annotation scheme. We evaluate the system’s performance when trained on five different standard data sets for offensive language detection. Systems trained on different data sets have different strengths and weaknesses, with most performing poorly on the phenomena of reclaimed slurs and pejorative nominalizations. NOTE: This paper contains sensitive and offensive material. The offensive materials are part of a complex puzzle we wish to better understand; they appear in the form of lightly-censored slurs and degrading insults. We do not condone this type of language, nor does it reflect the attitudes or beliefs of the authors.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"141 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"COLD: Annotation scheme and evaluation data set for complex offensive language in English\",\"authors\":\"Alexis Palmer, Christine Carr, Melissa Robinson, Jordan Sanders\",\"doi\":\"10.21248/jlcl.34.2020.222\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a new, extensible annotation scheme for offensive language data sets. The annotation scheme expands coverage beyond fairly straightforward cases of offensive language to address several cases of complex, implicit, and/or pragmatically-triggered offensive language. We apply the annotation scheme to create a new Complex Offensive Language Data Set for English ( COLD-EN ). The primary purpose of this data set is to diagnose how well systems for automatic detection of abusive language are able to classify three types of complex offensive language: reclaimed slurs, offensive utterances containing pejorative adjectival nominalizations (and no slur terms), and utterances conveying offense through linguistic distancing. COLD offers a straightforward framework for error analysis. Our vision is that researchers will use this data set to diagnose the strengths and weaknesses of their offensive language detection systems. In this paper, we diagnose some strengths and weaknesses of a top-performing offensive language detection system by: a) using it to classify COLD , and b) investigating its performance on the 10 fine-grained categories supported by our annotation scheme. We evaluate the system’s performance when trained on five different standard data sets for offensive language detection. Systems trained on different data sets have different strengths and weaknesses, with most performing poorly on the phenomena of reclaimed slurs and pejorative nominalizations. NOTE: This paper contains sensitive and offensive material. The offensive materials are part of a complex puzzle we wish to better understand; they appear in the form of lightly-censored slurs and degrading insults. We do not condone this type of language, nor does it reflect the attitudes or beliefs of the authors.\",\"PeriodicalId\":137584,\"journal\":{\"name\":\"Journal for Language Technology and Computational Linguistics\",\"volume\":\"141 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal for Language Technology and Computational Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21248/jlcl.34.2020.222\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal for Language Technology and Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.34.2020.222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

摘要

本文提出了一种新的、可扩展的攻击性语言数据集标注方案。注释方案将覆盖范围扩展到相当简单的冒犯性语言之外,以解决复杂的、隐式的和/或由实用主义触发的冒犯性语言的几种情况。我们应用该标注方案创建了一个新的英语复杂攻击性语言数据集(COLD-EN)。该数据集的主要目的是诊断辱骂性语言自动检测系统对三种复杂的攻击性语言的分类能力:回收的辱骂性语言,包含贬义形容词名化(不含辱骂性术语)的攻击性话语,以及通过语言距离传达冒犯性的话语。COLD为错误分析提供了一个简单的框架。我们的愿景是,研究人员将使用这些数据集来诊断他们的攻击性语言检测系统的优缺点。在本文中,我们诊断了一个表现最好的攻击性语言检测系统的一些优点和缺点:a)使用它对COLD进行分类,b)研究了它在我们的标注方案支持的10个细粒度类别上的性能。我们在五种不同的攻击性语言检测标准数据集上训练时评估了系统的性能。在不同的数据集上训练的系统有不同的优点和缺点,大多数系统在回收的诽谤和贬义的名词化现象上表现不佳。注意:本文包含敏感和冒犯性的内容。这些令人反感的材料是我们希望更好地理解的一个复杂谜题的一部分;它们以轻微审查的诽谤和有辱人格的侮辱的形式出现。我们不容忍这种语言,它也不反映作者的态度或信仰。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
COLD: Annotation scheme and evaluation data set for complex offensive language in English
This paper presents a new, extensible annotation scheme for offensive language data sets. The annotation scheme expands coverage beyond fairly straightforward cases of offensive language to address several cases of complex, implicit, and/or pragmatically-triggered offensive language. We apply the annotation scheme to create a new Complex Offensive Language Data Set for English ( COLD-EN ). The primary purpose of this data set is to diagnose how well systems for automatic detection of abusive language are able to classify three types of complex offensive language: reclaimed slurs, offensive utterances containing pejorative adjectival nominalizations (and no slur terms), and utterances conveying offense through linguistic distancing. COLD offers a straightforward framework for error analysis. Our vision is that researchers will use this data set to diagnose the strengths and weaknesses of their offensive language detection systems. In this paper, we diagnose some strengths and weaknesses of a top-performing offensive language detection system by: a) using it to classify COLD , and b) investigating its performance on the 10 fine-grained categories supported by our annotation scheme. We evaluate the system’s performance when trained on five different standard data sets for offensive language detection. Systems trained on different data sets have different strengths and weaknesses, with most performing poorly on the phenomena of reclaimed slurs and pejorative nominalizations. NOTE: This paper contains sensitive and offensive material. The offensive materials are part of a complex puzzle we wish to better understand; they appear in the form of lightly-censored slurs and degrading insults. We do not condone this type of language, nor does it reflect the attitudes or beliefs of the authors.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信