Kuralbayev Aibek, Mukhsimbayev Bobur, Bekbaganbetov Abay, Fuad Hajiyev
{"title":"司法文本数据命名实体识别算法比较","authors":"Kuralbayev Aibek, Mukhsimbayev Bobur, Bekbaganbetov Abay, Fuad Hajiyev","doi":"10.1109/AICT50176.2020.9368843","DOIUrl":null,"url":null,"abstract":"The more developed the society, the higher the role of legal relations. Accordingly, the number of court appeals is growing rapidly both from individuals and legal entities. Therefore, in any society the following tasks become extremely important.1)Reducing the time spent for the legal process, including reducing \"errors\" at various levels.2)Reducing the number of appeals to the courts and increasing the role of mediation.The authors have developed a prototype of the \"Smart Judge Assistant\", SJA, recommender system, which largely solves both tasks. The prototype of the recommender system has already successfully passed the first stage of testing by the Supreme Court of the Republic of Kazakhstan.When developing the prototype, the authors faced various problems related to text recognition. One of them is the problem of data publicity.Objective of the article: One of the main tasks in making documents public, is to hide personal data of parties. In this article we compare several Named Entity Recognition (NER) models to extract personal information from judicial acts (in russian and kazakh languages), such as a person name, an organization name and a location name.Methodology: Four types of algorithms were chosen for training NER models: CRF (Conditional Random Fields), LSTM (Long Short Term Memory) with character embeddings, LSTM-CRF and BERT (Bidirectional Encoder Representations from Transformers).Findings: Models trained by all four algorithms have reasonably high accuracy because of the almost alike structure of source documents in judicial dataset. BERT algorithm shows the best performance out of four algorithms (F1 score: 0.87).","PeriodicalId":136491,"journal":{"name":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Named Entity Recognition Algorithms Comparison For Judicial Text Data\",\"authors\":\"Kuralbayev Aibek, Mukhsimbayev Bobur, Bekbaganbetov Abay, Fuad Hajiyev\",\"doi\":\"10.1109/AICT50176.2020.9368843\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The more developed the society, the higher the role of legal relations. Accordingly, the number of court appeals is growing rapidly both from individuals and legal entities. Therefore, in any society the following tasks become extremely important.1)Reducing the time spent for the legal process, including reducing \\\"errors\\\" at various levels.2)Reducing the number of appeals to the courts and increasing the role of mediation.The authors have developed a prototype of the \\\"Smart Judge Assistant\\\", SJA, recommender system, which largely solves both tasks. The prototype of the recommender system has already successfully passed the first stage of testing by the Supreme Court of the Republic of Kazakhstan.When developing the prototype, the authors faced various problems related to text recognition. One of them is the problem of data publicity.Objective of the article: One of the main tasks in making documents public, is to hide personal data of parties. In this article we compare several Named Entity Recognition (NER) models to extract personal information from judicial acts (in russian and kazakh languages), such as a person name, an organization name and a location name.Methodology: Four types of algorithms were chosen for training NER models: CRF (Conditional Random Fields), LSTM (Long Short Term Memory) with character embeddings, LSTM-CRF and BERT (Bidirectional Encoder Representations from Transformers).Findings: Models trained by all four algorithms have reasonably high accuracy because of the almost alike structure of source documents in judicial dataset. BERT algorithm shows the best performance out of four algorithms (F1 score: 0.87).\",\"PeriodicalId\":136491,\"journal\":{\"name\":\"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AICT50176.2020.9368843\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICT50176.2020.9368843","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Named Entity Recognition Algorithms Comparison For Judicial Text Data
The more developed the society, the higher the role of legal relations. Accordingly, the number of court appeals is growing rapidly both from individuals and legal entities. Therefore, in any society the following tasks become extremely important.1)Reducing the time spent for the legal process, including reducing "errors" at various levels.2)Reducing the number of appeals to the courts and increasing the role of mediation.The authors have developed a prototype of the "Smart Judge Assistant", SJA, recommender system, which largely solves both tasks. The prototype of the recommender system has already successfully passed the first stage of testing by the Supreme Court of the Republic of Kazakhstan.When developing the prototype, the authors faced various problems related to text recognition. One of them is the problem of data publicity.Objective of the article: One of the main tasks in making documents public, is to hide personal data of parties. In this article we compare several Named Entity Recognition (NER) models to extract personal information from judicial acts (in russian and kazakh languages), such as a person name, an organization name and a location name.Methodology: Four types of algorithms were chosen for training NER models: CRF (Conditional Random Fields), LSTM (Long Short Term Memory) with character embeddings, LSTM-CRF and BERT (Bidirectional Encoder Representations from Transformers).Findings: Models trained by all four algorithms have reasonably high accuracy because of the almost alike structure of source documents in judicial dataset. BERT algorithm shows the best performance out of four algorithms (F1 score: 0.87).