REDFM: a Filtered and Multilingual Relation Extraction Dataset

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-16 DOI:10.48550/arXiv.2306.09802

Pere-Llu'is Huguet Cabot, Simone Tedeschi, A. N. Ngomo, Roberto Navigli

{"title":"REDFM: a Filtered and Multilingual Relation Extraction Dataset","authors":"Pere-Llu'is Huguet Cabot, Simone Tedeschi, A. N. Ngomo, Roberto Navigli","doi":"10.48550/arXiv.2306.09802","DOIUrl":null,"url":null,"abstract":"Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English.In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems.First, we present SREDFM, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. Second, we propose REDFM, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems. To demonstrate the utility of these novel datasets, we experiment with the first end-to-end multilingual RE model, mREBEL, that extracts triplets, including entity types, in multiple languages. We release our resources and model checkpoints at [https://www.github.com/babelscape/rebel](https://www.github.com/babelscape/rebel).","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"172 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Meeting of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.09802","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English.In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems.First, we present SREDFM, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. Second, we propose REDFM, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems. To demonstrate the utility of these novel datasets, we experiment with the first end-to-end multilingual RE model, mREBEL, that extracts triplets, including entity types, in multiple languages. We release our resources and model checkpoints at [https://www.github.com/babelscape/rebel](https://www.github.com/babelscape/rebel).

查看原文本刊更多论文

REDFM:一个过滤的多语言关系提取数据集

关系抽取(RE)是一项识别文本中实体之间关系的任务，能够获取关系事实，弥合自然语言和结构化知识之间的差距。然而，当前的RE模型通常依赖于关系类型覆盖率低的小数据集，特别是在处理英语以外的语言时。在本文中，我们解决了上述问题，并提供了两个新的资源，使多语言RE系统的培训和评估成为可能。首先，我们提出了SREDFM，这是一个自动注释的数据集，涵盖18种语言，400种关系类型，13种实体类型，总共超过4000万个三元组实例。其次，我们提出了REDFM，这是一个较小的、人为修改的七种语言数据集，允许对多语言RE系统进行评估。为了展示这些新数据集的实用性，我们用第一个端到端多语言RE模型mREBEL进行了实验，该模型可以用多种语言提取三元组，包括实体类型。我们在[https://www.github.com/babelscape/rebel](https://www.github.com/babelscape/rebel)上释放我们的资源和模型检查点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量