Annotating Entities with Fine-Grained Types in Austrian Court Decisions

International Conference on Semantic Systems Pub Date : 2021-08-31 DOI:10.3233/ssw210041

Artem Revenko, Anna Breit, V. Mireles, J. Moreno-Schneider, C. Sageder, Sotirios Karampatakis

{"title":"Annotating Entities with Fine-Grained Types in Austrian Court Decisions","authors":"Artem Revenko, Anna Breit, V. Mireles, J. Moreno-Schneider, C. Sageder, Sotirios Karampatakis","doi":"10.3233/ssw210041","DOIUrl":null,"url":null,"abstract":"The usage of Named Entity Recognition tools on domain-specific corpora is often hampered by insufficient training data. We investigate an approach to produce fine-grained named entity annotations of a large corpus of Austrian court decisions from a small manually annotated training data set. We apply a general purpose Named Entity Recognition model to produce annotations of common coarse-grained types. Next, a small sample of these annotations are manually inspected by domain experts to produce an initial fine-grained training data set. To efficiently use the small manually annotated data set we formulate the task of named entity typing as a binary classification task – for each originally annotated occurrence of an entity, and for each fine-grained type we verify if the entity belongs to it. For this purpose we train a transformer-based classifier. We randomly sample 547 predictions and evaluate them manually. The incorrect predictions are used to improve the performance of the classifier – the corrected annotations are added to the training set. The experiments show that re-training with even a very small number (5 or 10) of originally incorrect predictions can significantly improve the classifier performance. We finally train the classifier on all available data and re-annotate the whole data set.","PeriodicalId":275036,"journal":{"name":"International Conference on Semantic Systems","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Semantic Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/ssw210041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The usage of Named Entity Recognition tools on domain-specific corpora is often hampered by insufficient training data. We investigate an approach to produce fine-grained named entity annotations of a large corpus of Austrian court decisions from a small manually annotated training data set. We apply a general purpose Named Entity Recognition model to produce annotations of common coarse-grained types. Next, a small sample of these annotations are manually inspected by domain experts to produce an initial fine-grained training data set. To efficiently use the small manually annotated data set we formulate the task of named entity typing as a binary classification task – for each originally annotated occurrence of an entity, and for each fine-grained type we verify if the entity belongs to it. For this purpose we train a transformer-based classifier. We randomly sample 547 predictions and evaluate them manually. The incorrect predictions are used to improve the performance of the classifier – the corrected annotations are added to the training set. The experiments show that re-training with even a very small number (5 or 10) of originally incorrect predictions can significantly improve the classifier performance. We finally train the classifier on all available data and re-annotate the whole data set.

查看原文本刊更多论文

在奥地利法院判决中用细粒度类型注释实体

命名实体识别工具在特定领域语料库上的应用常常受到训练数据不足的阻碍。我们研究了一种方法，从一个小的手动注释的训练数据集中产生奥地利法院判决的大型语料库的细粒度命名实体注释。我们应用一个通用的命名实体识别模型来生成常见的粗粒度类型的注释。接下来，由领域专家手动检查这些注释的一个小样本，以生成初始的细粒度训练数据集。为了有效地使用手工标注的小数据集，我们将命名实体类型化的任务表述为二进制分类任务——对于一个实体的每个原始标注事件，以及对于每个细粒度类型，我们验证该实体是否属于它。为此，我们训练一个基于变压器的分类器。我们随机抽取547个预测并进行人工评估。错误的预测被用来提高分类器的性能——正确的注释被添加到训练集中。实验表明，即使使用非常少的(5或10)个原始错误预测进行重新训练也可以显着提高分类器的性能。最后，我们在所有可用的数据上训练分类器，并重新注释整个数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Semantic Systems

自引率

0.00%

发文量