司法文本数据命名实体识别算法比较

2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT) Pub Date : 2020-10-07 DOI:10.1109/AICT50176.2020.9368843

Kuralbayev Aibek, Mukhsimbayev Bobur, Bekbaganbetov Abay, Fuad Hajiyev

{"title":"司法文本数据命名实体识别算法比较","authors":"Kuralbayev Aibek, Mukhsimbayev Bobur, Bekbaganbetov Abay, Fuad Hajiyev","doi":"10.1109/AICT50176.2020.9368843","DOIUrl":null,"url":null,"abstract":"The more developed the society, the higher the role of legal relations. Accordingly, the number of court appeals is growing rapidly both from individuals and legal entities. Therefore, in any society the following tasks become extremely important.1)Reducing the time spent for the legal process, including reducing \"errors\" at various levels.2)Reducing the number of appeals to the courts and increasing the role of mediation.The authors have developed a prototype of the \"Smart Judge Assistant\", SJA, recommender system, which largely solves both tasks. The prototype of the recommender system has already successfully passed the first stage of testing by the Supreme Court of the Republic of Kazakhstan.When developing the prototype, the authors faced various problems related to text recognition. One of them is the problem of data publicity.Objective of the article: One of the main tasks in making documents public, is to hide personal data of parties. In this article we compare several Named Entity Recognition (NER) models to extract personal information from judicial acts (in russian and kazakh languages), such as a person name, an organization name and a location name.Methodology: Four types of algorithms were chosen for training NER models: CRF (Conditional Random Fields), LSTM (Long Short Term Memory) with character embeddings, LSTM-CRF and BERT (Bidirectional Encoder Representations from Transformers).Findings: Models trained by all four algorithms have reasonably high accuracy because of the almost alike structure of source documents in judicial dataset. BERT algorithm shows the best performance out of four algorithms (F1 score: 0.87).","PeriodicalId":136491,"journal":{"name":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Named Entity Recognition Algorithms Comparison For Judicial Text Data\",\"authors\":\"Kuralbayev Aibek, Mukhsimbayev Bobur, Bekbaganbetov Abay, Fuad Hajiyev\",\"doi\":\"10.1109/AICT50176.2020.9368843\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The more developed the society, the higher the role of legal relations. Accordingly, the number of court appeals is growing rapidly both from individuals and legal entities. Therefore, in any society the following tasks become extremely important.1)Reducing the time spent for the legal process, including reducing \\\"errors\\\" at various levels.2)Reducing the number of appeals to the courts and increasing the role of mediation.The authors have developed a prototype of the \\\"Smart Judge Assistant\\\", SJA, recommender system, which largely solves both tasks. The prototype of the recommender system has already successfully passed the first stage of testing by the Supreme Court of the Republic of Kazakhstan.When developing the prototype, the authors faced various problems related to text recognition. One of them is the problem of data publicity.Objective of the article: One of the main tasks in making documents public, is to hide personal data of parties. In this article we compare several Named Entity Recognition (NER) models to extract personal information from judicial acts (in russian and kazakh languages), such as a person name, an organization name and a location name.Methodology: Four types of algorithms were chosen for training NER models: CRF (Conditional Random Fields), LSTM (Long Short Term Memory) with character embeddings, LSTM-CRF and BERT (Bidirectional Encoder Representations from Transformers).Findings: Models trained by all four algorithms have reasonably high accuracy because of the almost alike structure of source documents in judicial dataset. BERT algorithm shows the best performance out of four algorithms (F1 score: 0.87).\",\"PeriodicalId\":136491,\"journal\":{\"name\":\"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AICT50176.2020.9368843\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICT50176.2020.9368843","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

社会越发达，法律关系的作用就越高。因此，个人和法人向法院提出上诉的数目正在迅速增加。因此，在任何社会中，以下任务都变得极其重要:1)减少法律程序所花费的时间，包括减少各级“错误”。2)减少向法院上诉的次数，增加调解的作用。作者开发了一个“智能法官助理”(Smart Judge Assistant, SJA)推荐系统的原型，在很大程度上解决了这两个任务。推荐系统的原型已经成功地通过了哈萨克斯坦共和国最高法院的第一阶段测试。在开发原型时，作者面临着与文本识别相关的各种问题。其中之一就是数据公开的问题。文章目的:文件公开的主要任务之一是隐藏当事人的个人信息。在本文中，我们比较了几种命名实体识别(NER)模型，以从司法行为(俄语和哈萨克语)中提取个人信息，如人名、组织名称和地点名称。方法:选择了四种算法来训练NER模型:CRF(条件随机场)，LSTM(长短期记忆)与字符嵌入，LSTM-CRF和BERT(来自变形金刚的双向编码器表示)。结果:由于司法数据集中的源文件结构几乎相同，因此四种算法训练的模型具有相当高的准确性。BERT算法在四种算法中表现最好(F1得分为0.87)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Named Entity Recognition Algorithms Comparison For Judicial Text Data

The more developed the society, the higher the role of legal relations. Accordingly, the number of court appeals is growing rapidly both from individuals and legal entities. Therefore, in any society the following tasks become extremely important.1)Reducing the time spent for the legal process, including reducing "errors" at various levels.2)Reducing the number of appeals to the courts and increasing the role of mediation.The authors have developed a prototype of the "Smart Judge Assistant", SJA, recommender system, which largely solves both tasks. The prototype of the recommender system has already successfully passed the first stage of testing by the Supreme Court of the Republic of Kazakhstan.When developing the prototype, the authors faced various problems related to text recognition. One of them is the problem of data publicity.Objective of the article: One of the main tasks in making documents public, is to hide personal data of parties. In this article we compare several Named Entity Recognition (NER) models to extract personal information from judicial acts (in russian and kazakh languages), such as a person name, an organization name and a location name.Methodology: Four types of algorithms were chosen for training NER models: CRF (Conditional Random Fields), LSTM (Long Short Term Memory) with character embeddings, LSTM-CRF and BERT (Bidirectional Encoder Representations from Transformers).Findings: Models trained by all four algorithms have reasonably high accuracy because of the almost alike structure of source documents in judicial dataset. BERT algorithm shows the best performance out of four algorithms (F1 score: 0.87).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)

自引率

0.00%

发文量