Named-entity recognition in Turkish legal texts

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering Pub Date : 2022-07-11 DOI:10.1017/S1351324922000304

Can Çetindağ, Berkay Yazıcıoğlu, Aykut Koç

{"title":"Named-entity recognition in Turkish legal texts","authors":"Can Çetindağ, Berkay Yazıcıoğlu, Aykut Koç","doi":"10.1017/S1351324922000304","DOIUrl":null,"url":null,"abstract":"Abstract Natural language processing (NLP) technologies and applications in legal text processing are gaining momentum. Being one of the most prominent tasks in NLP, named-entity recognition (NER) can substantiate a great convenience for NLP in law due to the variety of named entities in the legal domain and their accentuated importance in legal documents. However, domain-specific NER models in the legal domain are not well studied. We present a NER model for Turkish legal texts with a custom-made corpus as well as several NER architectures based on conditional random fields and bidirectional long-short-term memories (BiLSTMs) to address the task. We also study several combinations of different word embeddings consisting of GloVe, Morph2Vec, and neural network-based character feature extraction techniques either with BiLSTM or convolutional neural networks. We report 92.27% F1 score with a hybrid word representation of GloVe and Morph2Vec with character-level features extracted with BiLSTM. Being an agglutinative language, the morphological structure of Turkish is also considered. To the best of our knowledge, our work is the first legal domain-specific NER study in Turkish and also the first study for an agglutinative language in the legal domain. Thus, our work can also have implications beyond the Turkish language.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"615 - 642"},"PeriodicalIF":1.9000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000304","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 8

Abstract

Abstract Natural language processing (NLP) technologies and applications in legal text processing are gaining momentum. Being one of the most prominent tasks in NLP, named-entity recognition (NER) can substantiate a great convenience for NLP in law due to the variety of named entities in the legal domain and their accentuated importance in legal documents. However, domain-specific NER models in the legal domain are not well studied. We present a NER model for Turkish legal texts with a custom-made corpus as well as several NER architectures based on conditional random fields and bidirectional long-short-term memories (BiLSTMs) to address the task. We also study several combinations of different word embeddings consisting of GloVe, Morph2Vec, and neural network-based character feature extraction techniques either with BiLSTM or convolutional neural networks. We report 92.27% F1 score with a hybrid word representation of GloVe and Morph2Vec with character-level features extracted with BiLSTM. Being an agglutinative language, the morphological structure of Turkish is also considered. To the best of our knowledge, our work is the first legal domain-specific NER study in Turkish and also the first study for an agglutinative language in the legal domain. Thus, our work can also have implications beyond the Turkish language.

查看原文本刊更多论文

土耳其法律文本中的命名实体识别

摘要自然语言处理（NLP）技术及其在法律文本处理中的应用正在蓬勃发展。命名实体识别是NLP中最突出的任务之一，由于法律领域中命名实体的多样性及其在法律文件中的重要性，它可以为NLP在法律上提供极大的便利。然而，法律领域中特定领域的NER模型并没有得到很好的研究。我们提出了一个土耳其法律文本的NER模型，该模型具有定制的语料库，以及基于条件随机场和双向长短期记忆（BiLSTM）的几种NER架构，以解决该任务。我们还研究了不同单词嵌入的几种组合，包括GloVe、Morph2Vec和基于神经网络的字符特征提取技术，无论是使用BiLSTM还是卷积神经网络。我们报告了使用GloVe和Morph2Verc的混合词表示以及使用BiLSTM提取的字符级特征的92.27%的F1分数。土耳其语作为一种粘着语言，其形态结构也被认为是一种粘着性语言。据我们所知，我们的工作是第一次用土耳其语对特定法律领域的NER进行研究，也是第一次对法律领域中的粘性语言进行研究。因此，我们的工作也可能产生超出土耳其语的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.