Named Entity Recognition Approaches Applied to Legal Document Segmentation

F. X. B. da Silva, G. M. C. Guimarães, R. Marcacini, A. L. Queiroz, V. R. P. Borges, T. P. Faleiros, L. P. F. Garcia
{"title":"Named Entity Recognition Approaches Applied to Legal Document Segmentation","authors":"F. X. B. da Silva, G. M. C. Guimarães, R. Marcacini, A. L. Queiroz, V. R. P. Borges, T. P. Faleiros, L. P. F. Garcia","doi":"10.5753/kdmile.2022.227949","DOIUrl":null,"url":null,"abstract":"Document Segmentation is a method of dividing a document into smaller parts, known as segments, which share similarities that allow machines to distinguish between them. It might be useful to classify these segments, making it a problem with two steps: (I) the extraction of the segments; and (II) the annotation of these segments. The Named Entity Recognition problem's goal is to identify and classify entities within a text, having also to deal with those two questions: extraction and classification. In this study, we tackle the problem of Document Segmentation and the annotation of these segments through NER approaches, using CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models. The study is focused on Brazilian legal documents, proposing a data set of 127 annotated Portuguese texts from the Official Gazette of the Federal District, published between 2001 and 2015. The experiments were made using word-based and sentence-based models, with CRF sentence-based model showing the best results.","PeriodicalId":417100,"journal":{"name":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/kdmile.2022.227949","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Document Segmentation is a method of dividing a document into smaller parts, known as segments, which share similarities that allow machines to distinguish between them. It might be useful to classify these segments, making it a problem with two steps: (I) the extraction of the segments; and (II) the annotation of these segments. The Named Entity Recognition problem's goal is to identify and classify entities within a text, having also to deal with those two questions: extraction and classification. In this study, we tackle the problem of Document Segmentation and the annotation of these segments through NER approaches, using CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models. The study is focused on Brazilian legal documents, proposing a data set of 127 annotated Portuguese texts from the Official Gazette of the Federal District, published between 2001 and 2015. The experiments were made using word-based and sentence-based models, with CRF sentence-based model showing the best results.
命名实体识别方法在法律文件分割中的应用
文档分割是一种将文档分成更小的部分的方法,称为段,这些部分具有相似性,使机器能够区分它们。对这些片段进行分类可能是有用的,使其成为两个步骤的问题:(I)提取片段;(二)对这些片段的注释。命名实体识别问题的目标是识别和分类文本中的实体,还必须处理这两个问题:提取和分类。在本研究中,我们使用CRF、CNN-CNN-LSTM和CNN-biLSTM-CRF模型,通过NER方法解决了文档分割和这些片段的标注问题。这项研究的重点是巴西的法律文件,提出了一套数据集,其中有127个带注释的葡萄牙语文本,来自2001年至2015年出版的联邦区官方公报。使用基于单词和基于句子的模型分别进行了实验,其中基于句子的CRF模型效果最好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信