Anotacijska shema i njezina evaluacija

IF 0.2 0 LANGUAGE & LINGUISTICS

Rasprave Pub Date : 2023-01-01 DOI:10.31724/rihjj.49.1.8

Barbara Lewandowska-Tomaszczyk, Slavko Žitnik, Olga Dontcheva-Navratilova, Agnieszka Borowiak, Kristina Despot, Jelena Mitrović, Chaya Liebeskind, Giedre Valunaite Oleskevicienė, Anna Bączkowska, Paul A. Wilson, Marcin Trojszczak, Ivana Brač, Lobel Filipić, Ana Ostroški Anić

{"title":"Anotacijska shema i njezina evaluacija","authors":"Barbara Lewandowska-Tomaszczyk, Slavko Žitnik, Olga Dontcheva-Navratilova, Agnieszka Borowiak, Kristina Despot, Jelena Mitrović, Chaya Liebeskind, Giedre Valunaite Oleskevicienė, Anna Bączkowska, Paul A. Wilson, Marcin Trojszczak, Ivana Brač, Lobel Filipić, Ana Ostroški Anić","doi":"10.31724/rihjj.49.1.8","DOIUrl":null,"url":null,"abstract":"The present paper focuses on the presentation and discussion of aspects of OFFENSIVE LANGUAGE linguistic annotation, including the creation, annotation practice, curation, and evaluation of an OFFENSIVE LANGUAGE annotation taxonomy scheme, that was first proposed in Lewandowska-Tomaszczyk et al. (2021). An extended offensive language ontology comprising 17 categories, structured in terms of 4 hierarchical levels, has been shown to represent the encoding of the defined offensive language schema, trained in terms of non-contextual word embeddings – i.e., Word2Vec and Fast Text, and eventually juxtaposed to the data acquired by using a pair wise training and testing analysis for existing categories in the HateBERT model (Lewandowska-Tomaszczyk et al. submitted). The study reports on the annotation practice in WG 4.1.1. Incivility in media and social media in the context of COST Action CA 18209 European network for Web-centred linguistic data science (Nexus Linguarum) with the INCEpTION tool (https://github.com/inception-project/inception) – a semantic annotation platform offering assistance in the annotation. The results partly support the proposed ontology of explicit offense and positive implicitness types to provide more variance among widely recognized types of figurative language (e.g., metaphorical, metonymic, ironic, etc.). The use of the annotation system and the representation of linguistic data were also evaluated in a series of the annotators’ comments, by means of a questionnaire and an open discussion. The annotation results and the questionnaire showed that for some of the categories there was low or medium inter-annotator agreement, and it was more challenging for annotators to distinguish between category items than between aspect items, with the category items offensive , insulting and abusive being the most difficult in this respect. The need for taxonomic simplification measures on the basis of these results has been recognized for further annotation practices.","PeriodicalId":51986,"journal":{"name":"Rasprave","volume":"1 1","pages":"0"},"PeriodicalIF":0.2000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Rasprave","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31724/rihjj.49.1.8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The present paper focuses on the presentation and discussion of aspects of OFFENSIVE LANGUAGE linguistic annotation, including the creation, annotation practice, curation, and evaluation of an OFFENSIVE LANGUAGE annotation taxonomy scheme, that was first proposed in Lewandowska-Tomaszczyk et al. (2021). An extended offensive language ontology comprising 17 categories, structured in terms of 4 hierarchical levels, has been shown to represent the encoding of the defined offensive language schema, trained in terms of non-contextual word embeddings – i.e., Word2Vec and Fast Text, and eventually juxtaposed to the data acquired by using a pair wise training and testing analysis for existing categories in the HateBERT model (Lewandowska-Tomaszczyk et al. submitted). The study reports on the annotation practice in WG 4.1.1. Incivility in media and social media in the context of COST Action CA 18209 European network for Web-centred linguistic data science (Nexus Linguarum) with the INCEpTION tool (https://github.com/inception-project/inception) – a semantic annotation platform offering assistance in the annotation. The results partly support the proposed ontology of explicit offense and positive implicitness types to provide more variance among widely recognized types of figurative language (e.g., metaphorical, metonymic, ironic, etc.). The use of the annotation system and the representation of linguistic data were also evaluated in a series of the annotators’ comments, by means of a questionnaire and an open discussion. The annotation results and the questionnaire showed that for some of the categories there was low or medium inter-annotator agreement, and it was more challenging for annotators to distinguish between category items than between aspect items, with the category items offensive , insulting and abusive being the most difficult in this respect. The need for taxonomic simplification measures on the basis of these results has been recognized for further annotation practices.

查看原文本刊更多论文

注释方案及其评估

本文侧重于介绍和讨论OFFENSIVE LANGUAGE语言注释的各个方面，包括创建、注释实践、管理和评估OFFENSIVE LANGUAGE注释分类方案，该方案最初由Lewandowska-Tomaszczyk等人(2021)提出。一个扩展的攻击性语言本体包含17个类别，按4个层次结构构成，已被证明代表了定义的攻击性语言模式的编码，根据非上下文词嵌入(即Word2Vec和Fast Text)进行训练，并最终与使用HateBERT模型中现有类别的配对训练和测试分析获得的数据并置(Lewandowska-Tomaszczyk等人提交)。本研究报告了wg4.1.1的注释实践。成本行动CA 18209欧洲以网络为中心的语言数据科学网络(Nexus Linguarum)，使用INCEpTION工具(https://github.com/inception-project/inception)——一个语义注释平台，为注释提供帮助。结果在一定程度上支持了外显冒犯和积极内隐类型本体论的提出，在广泛认可的比喻语言类型(如隐喻、转喻、讽刺等)之间提供了更多的差异。通过问卷调查和公开讨论，在一系列注释者的评论中，对注释系统的使用和语言数据的表示进行了评估。注释结果和问卷结果显示，部分类别的注释者之间存在低或中等程度的一致性，并且注释者区分类别条目比区分方面条目更具挑战性，其中冒犯性、侮辱性和辱骂性类别条目在这方面最难区分。在这些结果的基础上，有必要采取分类简化措施，以供进一步的注释实践使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊