Classification and identification level ambiguity in error annotation

Applied Corpus Linguistics Pub Date : 2022-12-01 DOI:10.1016/j.acorp.2022.100035

Alexandros Tantos, Nikolaos Amvrazis

引用次数: 1

Abstract

The vast majority of corpus annotation projects goes through a piloting phase in which the annotation scheme is gradually shaped through iterative annotation cycles until its final version is produced and applied to the collected data. The differences in annotators’ choices are usually recorded and reflected by the ‘Inter-annotator Agreement’ (IAA) that serves as a proxy to understand and resolve the raised issues. However, little has been reported on how to formulate a systematic approach to: (i) tracing the source of the differences in the annotators’ choices and (ii) provide attainable solutions that would considerably increase IAA. In this paper, the ‘Greek Learner Corpus II’ (GLCII) -the largest online greek learner corpus will serve as a basis to shed light on two commonly met types of ambiguity in error annotation that are closely related to target languages in which syncretism is ubiquitous in grammar (e.g., Greek and Romanian): a classification level and an identification level ambiguity.

查看原文本刊更多论文

错误标注中的分类和识别级别歧义

绝大多数语料库注释项目都经历了一个试验阶段，在这个阶段，通过迭代注释周期逐渐形成注释方案，直到产生最终版本并将其应用于收集的数据。注释者选择的差异通常由“注释者间协议”(IAA)记录和反映，该协议作为理解和解决所提出问题的代理。然而，关于如何制定一种系统的方法来:(i)追踪注释者选择差异的来源和(ii)提供可实现的解决方案，从而大大增加IAA的报道很少。在本文中，最大的在线希腊语学习者语料库“希腊语学习者语料库II”(GLCII)将作为揭示错误注释中两种常见的歧义类型的基础，这两种类型与目标语言密切相关，其中语法中普遍存在融合(例如希腊语和罗马尼亚语):分类级别的歧义和识别级别的歧义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊