Developing a Named Entity Recognition Dataset for Tagalog

arXiv (Cornell University) Pub Date : 2023-11-13 DOI:10.48550/arxiv.2311.07161

Miranda, Lester James V.

引用次数: 0

Abstract

We present the development of a Named Entity Recognition (NER) dataset for Tagalog. This corpus helps fill the resource gap present in Philippine languages today, where NER resources are scarce. The texts were obtained from a pretraining corpora containing news reports, and were labeled by native speakers in an iterative fashion. The resulting dataset contains ~7.8k documents across three entity types: Person, Organization, and Location. The inter-annotator agreement, as measured by Cohen's $\kappa$, is 0.81. We also conducted extensive empirical evaluation of state-of-the-art methods across supervised and transfer learning settings. Finally, we released the data and processing code publicly to inspire future work on Tagalog NLP.

查看原文本刊更多论文

为他加禄语开发一个命名实体识别数据集

我们提出了一个命名实体识别(NER)数据集的发展为他加禄语。这个语料库有助于填补目前菲律宾语言的资源缺口，在那里NER资源稀缺。这些文本是从包含新闻报道的预训练语料库中获得的，并由母语人士以迭代的方式进行标记。得到的数据集包含大约7.8k个文档，跨三种实体类型:Person、Organization和Location。用科恩的$\kappa$来衡量，注释者之间的一致性是0.81。我们还对监督学习和迁移学习设置中最先进的方法进行了广泛的实证评估。最后，我们公开发布了数据和处理代码，以激励未来在他加禄语NLP上的工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv (Cornell University)

自引率

0.00%

发文量