Named Entity Recognition Approaches Applied to Legal Document Segmentation

Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022) Pub Date : 2022-11-28 DOI:10.5753/kdmile.2022.227949

F. X. B. da Silva, G. M. C. Guimarães, R. Marcacini, A. L. Queiroz, V. R. P. Borges, T. P. Faleiros, L. P. F. Garcia

{"title":"Named Entity Recognition Approaches Applied to Legal Document Segmentation","authors":"F. X. B. da Silva, G. M. C. Guimarães, R. Marcacini, A. L. Queiroz, V. R. P. Borges, T. P. Faleiros, L. P. F. Garcia","doi":"10.5753/kdmile.2022.227949","DOIUrl":null,"url":null,"abstract":"Document Segmentation is a method of dividing a document into smaller parts, known as segments, which share similarities that allow machines to distinguish between them. It might be useful to classify these segments, making it a problem with two steps: (I) the extraction of the segments; and (II) the annotation of these segments. The Named Entity Recognition problem's goal is to identify and classify entities within a text, having also to deal with those two questions: extraction and classification. In this study, we tackle the problem of Document Segmentation and the annotation of these segments through NER approaches, using CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models. The study is focused on Brazilian legal documents, proposing a data set of 127 annotated Portuguese texts from the Official Gazette of the Federal District, published between 2001 and 2015. The experiments were made using word-based and sentence-based models, with CRF sentence-based model showing the best results.","PeriodicalId":417100,"journal":{"name":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/kdmile.2022.227949","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Document Segmentation is a method of dividing a document into smaller parts, known as segments, which share similarities that allow machines to distinguish between them. It might be useful to classify these segments, making it a problem with two steps: (I) the extraction of the segments; and (II) the annotation of these segments. The Named Entity Recognition problem's goal is to identify and classify entities within a text, having also to deal with those two questions: extraction and classification. In this study, we tackle the problem of Document Segmentation and the annotation of these segments through NER approaches, using CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models. The study is focused on Brazilian legal documents, proposing a data set of 127 annotated Portuguese texts from the Official Gazette of the Federal District, published between 2001 and 2015. The experiments were made using word-based and sentence-based models, with CRF sentence-based model showing the best results.

查看原文本刊更多论文

命名实体识别方法在法律文件分割中的应用

文档分割是一种将文档分成更小的部分的方法，称为段，这些部分具有相似性，使机器能够区分它们。对这些片段进行分类可能是有用的，使其成为两个步骤的问题:(I)提取片段;(二)对这些片段的注释。命名实体识别问题的目标是识别和分类文本中的实体，还必须处理这两个问题:提取和分类。在本研究中，我们使用CRF、CNN-CNN-LSTM和CNN-biLSTM-CRF模型，通过NER方法解决了文档分割和这些片段的标注问题。这项研究的重点是巴西的法律文件，提出了一套数据集，其中有127个带注释的葡萄牙语文本，来自2001年至2015年出版的联邦区官方公报。使用基于单词和基于句子的模型分别进行了实验，其中基于句子的CRF模型效果最好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)

自引率

0.00%

发文量