命名实体识别方法在法律文件分割中的应用

Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022) Pub Date : 2022-11-28 DOI:10.5753/kdmile.2022.227949

F. X. B. da Silva, G. M. C. Guimarães, R. Marcacini, A. L. Queiroz, V. R. P. Borges, T. P. Faleiros, L. P. F. Garcia

{"title":"命名实体识别方法在法律文件分割中的应用","authors":"F. X. B. da Silva, G. M. C. Guimarães, R. Marcacini, A. L. Queiroz, V. R. P. Borges, T. P. Faleiros, L. P. F. Garcia","doi":"10.5753/kdmile.2022.227949","DOIUrl":null,"url":null,"abstract":"Document Segmentation is a method of dividing a document into smaller parts, known as segments, which share similarities that allow machines to distinguish between them. It might be useful to classify these segments, making it a problem with two steps: (I) the extraction of the segments; and (II) the annotation of these segments. The Named Entity Recognition problem's goal is to identify and classify entities within a text, having also to deal with those two questions: extraction and classification. In this study, we tackle the problem of Document Segmentation and the annotation of these segments through NER approaches, using CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models. The study is focused on Brazilian legal documents, proposing a data set of 127 annotated Portuguese texts from the Official Gazette of the Federal District, published between 2001 and 2015. The experiments were made using word-based and sentence-based models, with CRF sentence-based model showing the best results.","PeriodicalId":417100,"journal":{"name":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Named Entity Recognition Approaches Applied to Legal Document Segmentation\",\"authors\":\"F. X. B. da Silva, G. M. C. Guimarães, R. Marcacini, A. L. Queiroz, V. R. P. Borges, T. P. Faleiros, L. P. F. Garcia\",\"doi\":\"10.5753/kdmile.2022.227949\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document Segmentation is a method of dividing a document into smaller parts, known as segments, which share similarities that allow machines to distinguish between them. It might be useful to classify these segments, making it a problem with two steps: (I) the extraction of the segments; and (II) the annotation of these segments. The Named Entity Recognition problem's goal is to identify and classify entities within a text, having also to deal with those two questions: extraction and classification. In this study, we tackle the problem of Document Segmentation and the annotation of these segments through NER approaches, using CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models. The study is focused on Brazilian legal documents, proposing a data set of 127 annotated Portuguese texts from the Official Gazette of the Federal District, published between 2001 and 2015. The experiments were made using word-based and sentence-based models, with CRF sentence-based model showing the best results.\",\"PeriodicalId\":417100,\"journal\":{\"name\":\"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5753/kdmile.2022.227949\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/kdmile.2022.227949","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

文档分割是一种将文档分成更小的部分的方法，称为段，这些部分具有相似性，使机器能够区分它们。对这些片段进行分类可能是有用的，使其成为两个步骤的问题:(I)提取片段;(二)对这些片段的注释。命名实体识别问题的目标是识别和分类文本中的实体，还必须处理这两个问题:提取和分类。在本研究中，我们使用CRF、CNN-CNN-LSTM和CNN-biLSTM-CRF模型，通过NER方法解决了文档分割和这些片段的标注问题。这项研究的重点是巴西的法律文件，提出了一套数据集，其中有127个带注释的葡萄牙语文本，来自2001年至2015年出版的联邦区官方公报。使用基于单词和基于句子的模型分别进行了实验，其中基于句子的CRF模型效果最好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Named Entity Recognition Approaches Applied to Legal Document Segmentation

Document Segmentation is a method of dividing a document into smaller parts, known as segments, which share similarities that allow machines to distinguish between them. It might be useful to classify these segments, making it a problem with two steps: (I) the extraction of the segments; and (II) the annotation of these segments. The Named Entity Recognition problem's goal is to identify and classify entities within a text, having also to deal with those two questions: extraction and classification. In this study, we tackle the problem of Document Segmentation and the annotation of these segments through NER approaches, using CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models. The study is focused on Brazilian legal documents, proposing a data set of 127 annotated Portuguese texts from the Official Gazette of the Federal District, published between 2001 and 2015. The experiments were made using word-based and sentence-based models, with CRF sentence-based model showing the best results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Anais do X Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2022)

自引率

0.00%

发文量