Annotation Curricula to Implicitly Train Non-Expert Annotators

IF 5.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics Pub Date : 2021-06-04 DOI:10.1162/coli_a_00436

Ji-Ung Lee, Jan-Christoph Klie, Iryna Gurevych

{"title":"Annotation Curricula to Implicitly Train Non-Expert Annotators","authors":"Ji-Ung Lee, Jan-Christoph Klie, Iryna Gurevych","doi":"10.1162/coli_a_00436","DOIUrl":null,"url":null,"abstract":"Abstract Annotation studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain. This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations; especially in citizen science or crowdsourcing scenarios where domain expertise is not required. To alleviate these issues, this work proposes annotation curricula, a novel approach to implicitly train annotators. The goal is to gradually introduce annotators into the task by ordering instances to be annotated according to a learning curriculum. To do so, this work formalizes annotation curricula for sentence- and paragraph-level annotation tasks, defines an ordering strategy, and identifies well-performing heuristics and interactively trained models on three existing English datasets. Finally, we provide a proof of concept for annotation curricula in a carefully designed user study with 40 voluntary participants who are asked to identify the most fitting misconception for English tweets about the Covid-19 pandemic. The results indicate that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality. Annotation curricula thus can be a promising research direction to improve data collection. To facilitate future research—for instance, to adapt annotation curricula to specific tasks and expert annotation scenarios—all code and data from the user study consisting of 2,400 annotations is made available.1","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"343-373"},"PeriodicalIF":5.3000,"publicationDate":"2021-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00436","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 7

Abstract

Abstract Annotation studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain. This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations; especially in citizen science or crowdsourcing scenarios where domain expertise is not required. To alleviate these issues, this work proposes annotation curricula, a novel approach to implicitly train annotators. The goal is to gradually introduce annotators into the task by ordering instances to be annotated according to a learning curriculum. To do so, this work formalizes annotation curricula for sentence- and paragraph-level annotation tasks, defines an ordering strategy, and identifies well-performing heuristics and interactively trained models on three existing English datasets. Finally, we provide a proof of concept for annotation curricula in a carefully designed user study with 40 voluntary participants who are asked to identify the most fitting misconception for English tweets about the Covid-19 pandemic. The results indicate that using a simple heuristic to order instances can already significantly reduce the total annotation time while preserving a high annotation quality. Annotation curricula thus can be a promising research direction to improve data collection. To facilitate future research—for instance, to adapt annotation curricula to specific tasks and expert annotation scenarios—all code and data from the user study consisting of 2,400 annotations is made available.1

查看原文本刊更多论文

隐式训练非专家注释者的注释课程

摘要注释研究通常要求注释者熟悉任务、注释方案和数据域。这在一开始可能会让人不知所措，耗费大量精力，并导致注释出错；尤其是在不需要领域专业知识的公民科学或众包场景中。为了缓解这些问题，这项工作提出了注释课程，这是一种隐含训练注释者的新方法。目标是通过根据学习课程对要注释的实例进行排序，逐步将注释器引入到任务中。为此，这项工作正式化了句子和段落级注释任务的注释课程，定义了排序策略，并在三个现有的英语数据集上确定了性能良好的启发式方法和交互式训练的模型。最后，我们在一项精心设计的用户研究中为注释课程提供了概念验证，该研究有40名自愿参与者，他们被要求确定关于新冠肺炎大流行的英语推文最合适的误解。结果表明，使用简单的启发式排序实例已经可以显著减少总的注释时间，同时保持高的注释质量。因此，注释课程可以成为改进数据收集的一个很有前途的研究方向。为了促进未来的研究——例如，使注释课程适应特定任务和专家注释场景——由2400个注释组成的用户研究的所有代码和数据都可用。1

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.