A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks

IF 4.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Artificial Intelligence Research Pub Date : 2023-12-11 DOI:10.1613/jair.1.14388

Alexander Braylan, Madalyn Marabella, Omar Alonso, Matthew Lease

{"title":"A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks","authors":"Alexander Braylan, Madalyn Marabella, Omar Alonso, Matthew Lease","doi":"10.1613/jair.1.14388","DOIUrl":null,"url":null,"abstract":"Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A common strategy to improve label quality is to ask multiple annotators to label the same item and then aggregate their labels. To date, many aggregation models have been proposed for simple categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks, such as those involving open-ended, multivariate, or structured responses. Similarly, while a variety of bespoke models have been proposed for specific tasks, our work is the first we are aware of to introduce aggregation methods that generalize across many, diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by applying readily available task-specific distance functions, then devising a task-agnostic method to model these distances between labels, rather than the labels themselves.\nThis article presents a unified treatment of our prior work on complex annotation modeling and extends that work with investigation of three new research questions. First, how do complex annotation task and dataset properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices in order to maximize aggregation accuracy? Finally, what tests and diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct large-scale simulation studies and broad experiments on real, complex datasets. Regarding testing, we introduce the concept of unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior.\nBeyond investigating these research questions above, we discuss the foundational concept and nature of annotation complexity, present a new aggregation model as a conceptual bridge between traditional models and our own, and contribute a new general semisupervised learning method for complex label aggregation that outperforms prior work.","PeriodicalId":54877,"journal":{"name":"Journal of Artificial Intelligence Research","volume":"194 1","pages":""},"PeriodicalIF":4.5000,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Artificial Intelligence Research","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1613/jair.1.14388","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A common strategy to improve label quality is to ask multiple annotators to label the same item and then aggregate their labels. To date, many aggregation models have been proposed for simple categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks, such as those involving open-ended, multivariate, or structured responses. Similarly, while a variety of bespoke models have been proposed for specific tasks, our work is the first we are aware of to introduce aggregation methods that generalize across many, diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by applying readily available task-specific distance functions, then devising a task-agnostic method to model these distances between labels, rather than the labels themselves. This article presents a unified treatment of our prior work on complex annotation modeling and extends that work with investigation of three new research questions. First, how do complex annotation task and dataset properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices in order to maximize aggregation accuracy? Finally, what tests and diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct large-scale simulation studies and broad experiments on real, complex datasets. Regarding testing, we introduce the concept of unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior. Beyond investigating these research questions above, we discuss the foundational concept and nature of annotation complexity, present a new aggregation model as a conceptual bridge between traditional models and our own, and contribute a new general semisupervised learning method for complex label aggregation that outperforms prior work.

查看原文本刊更多论文

在简单、复杂和多对象注释任务中聚合注释的通用模型

人工标注对监督学习至关重要，但标注者经常会对正确的标签产生分歧，尤其是当标注任务的复杂性增加时。提高标签质量的常见策略是让多个标注者对同一项目进行标注，然后汇总他们的标签。迄今为止，许多聚合模型都是针对简单的分类或数字标注任务提出的，但考虑到更复杂的标注任务（如涉及开放式、多变量或结构化响应的任务）的工作则少得多。同样，虽然针对特定任务已经提出了多种定制模型，但我们的工作是我们所知的首个引入聚合方法的工作，该方法可通用于多种复杂任务，包括序列标注、翻译、句法分析、排序、边界框和关键点。本文对我们之前在复杂注释建模方面的工作进行了统一处理，并通过研究三个新的研究问题对这些工作进行了扩展。首先，复杂注释任务和数据集属性如何影响聚合准确性？其次，任务负责人应如何在众多建模选择中游刃有余，以最大限度地提高聚合准确性？最后，有哪些测试和诊断方法可以验证聚合模型是根据给定数据正确指定的？为了了解各种因素对准确性的影响并为模型选择提供信息，我们在真实、复杂的数据集上进行了大规模的模拟研究和广泛的实验。在测试方面，我们引入了聚合模型单元测试的概念，并提出了一套此类测试，以确保给定的模型没有被错误地指定，并表现出预期的行为。除了研究上述这些研究问题，我们还讨论了注释复杂性的基础概念和性质，提出了一种新的聚合模型，作为传统模型和我们自己的模型之间的概念桥梁，并为复杂标签聚合贡献了一种新的通用半监督学习方法，其性能优于之前的工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Artificial Intelligence Research 工程技术-计算机：人工智能

CiteScore

9.60

自引率

4.00%

发文量

审稿时长

4 months

期刊介绍： JAIR(ISSN 1076 - 9757) covers all areas of artificial intelligence (AI), publishing refereed research articles, survey articles, and technical notes. Established in 1993 as one of the first electronic scientific journals, JAIR is indexed by INSPEC, Science Citation Index, and MathSciNet. JAIR reviews papers within approximately three months of submission and publishes accepted articles on the internet immediately upon receiving the final versions. JAIR articles are published for free distribution on the internet by the AI Access Foundation, and for purchase in bound volumes by AAAI Press.