Contextual Reinforcement, Entity Delimitation and Generative Data Augmentation for Entity Recognition and Relation Extraction in Official Documents

J. Inf. Data Manag. Pub Date : 2023-10-31 DOI:10.5753/jidm.2023.3180

F. Belém, Cláudio M. V. de Andrade, Celso França, Marcos Carvalho, M. Ganem, Gabriel Teixeira, Gabriel Jallais, Alberto H. F. Laender, Marcos André Gonçalves

{"title":"Contextual Reinforcement, Entity Delimitation and Generative Data Augmentation for Entity Recognition and Relation Extraction in Official Documents","authors":"F. Belém, Cláudio M. V. de Andrade, Celso França, Marcos Carvalho, M. Ganem, Gabriel Teixeira, Gabriel Jallais, Alberto H. F. Laender, Marcos André Gonçalves","doi":"10.5753/jidm.2023.3180","DOIUrl":null,"url":null,"abstract":"Transformer architectures have become the main component of various state-of-the-art methods for natural language processing tasks, such as Named Entity Recognition and Relation Extraction (NER+RE). As these architectures rely on semantic (contextual) aspects of word sequences, they may fail to accurately identify and delimit entity spans when there is little semantic context surrounding the named entities. This is the case of entities composed only by digits and punctuation, such as IDs and phone numbers, as well as long composed names. In this article, we propose new techniques for contextual reinforcement and entity delimitation based on pre- and post-processing techniques to provide a richer semantic context, improving SpERT, a state-of-the-art Span-based Entity and Relation Transformer. To provide further context to the training process of NER+RE, we propose a data augmentation technique based on Generative Pretrained Transformers (GPT). We evaluate our strategies using real data from public administration documents (official gazettes and biddings) and court lawsuits. Our results show that our pre- and post-processing strategies, when used co-jointly, allows significant improvements on NER+ER effectiveness, while we also show the benefits of using GPT for training data augmentation.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"65 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Inf. Data Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/jidm.2023.3180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer architectures have become the main component of various state-of-the-art methods for natural language processing tasks, such as Named Entity Recognition and Relation Extraction (NER+RE). As these architectures rely on semantic (contextual) aspects of word sequences, they may fail to accurately identify and delimit entity spans when there is little semantic context surrounding the named entities. This is the case of entities composed only by digits and punctuation, such as IDs and phone numbers, as well as long composed names. In this article, we propose new techniques for contextual reinforcement and entity delimitation based on pre- and post-processing techniques to provide a richer semantic context, improving SpERT, a state-of-the-art Span-based Entity and Relation Transformer. To provide further context to the training process of NER+RE, we propose a data augmentation technique based on Generative Pretrained Transformers (GPT). We evaluate our strategies using real data from public administration documents (official gazettes and biddings) and court lawsuits. Our results show that our pre- and post-processing strategies, when used co-jointly, allows significant improvements on NER+ER effectiveness, while we also show the benefits of using GPT for training data augmentation.

查看原文本刊更多论文

用于官方文件中实体识别和关系提取的上下文强化、实体划界和生成性数据扩展

变换器架构已成为自然语言处理任务（如命名实体识别和关系提取（NER+RE））中各种最先进方法的主要组成部分。由于这些架构依赖于词序列的语义（上下文）方面，因此当命名实体周围的语义上下文很少时，它们可能无法准确识别和划分实体跨度。仅由数字和标点符号组成的实体（如身份证和电话号码）以及由长数字组成的名称就属于这种情况。在这篇文章中，我们提出了基于前处理和后处理技术的上下文强化和实体划分新技术，以提供更丰富的语义上下文，从而改进最先进的基于跨度的实体和关系转换器 SpERT。为了进一步为 NER+RE 的训练过程提供语境，我们提出了一种基于生成预训练转换器 (GPT) 的数据增强技术。我们使用来自公共行政文件（官方公报和招标书）和法院诉讼的真实数据对我们的策略进行了评估。我们的结果表明，我们的前处理和后处理策略联合使用时，可以显著提高 NER+ER 的效率，同时我们还显示了使用 GPT 进行训练数据扩增的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

J. Inf. Data Manag.

自引率

0.00%

发文量