阿塞拜疆语的命名实体识别

2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT) Pub Date : 2021-10-13 DOI:10.1109/AICT52784.2021.9620336

Natavan Akhundova

{"title":"阿塞拜疆语的命名实体识别","authors":"Natavan Akhundova","doi":"10.1109/AICT52784.2021.9620336","DOIUrl":null,"url":null,"abstract":"This research paper focuses on developing a Named Entity Recognition (NER) system for a low-resource language, namely Azerbaijani. The paper develops NER models with two different approaches which are rule-based and machine learning-based approaches and compares the performances of them with familiar and unfamiliar datasets to determine the best approach. The rule-based approach uses statistics as its main technique and brings sufficient results - 70% f-score for both datasets. The second method consists of three models. The first one is obtained by training a model from scratch with Convolution Neural Network (CNN) using Spacy library which results in the best outcome - above 90% f-score for each test dataset. Secondly, a pre-trained multilingual Spacy model is also used to contrast the results, which proves the importance of the domain in which a NER model is trained since this model scored less than 50% in testing. Additionally, a new model has also been trained on top of the multilingual model using training dataset and performs the best in its domain.","PeriodicalId":150606,"journal":{"name":"2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Named Entity Recognition for the Azerbaijani Language\",\"authors\":\"Natavan Akhundova\",\"doi\":\"10.1109/AICT52784.2021.9620336\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This research paper focuses on developing a Named Entity Recognition (NER) system for a low-resource language, namely Azerbaijani. The paper develops NER models with two different approaches which are rule-based and machine learning-based approaches and compares the performances of them with familiar and unfamiliar datasets to determine the best approach. The rule-based approach uses statistics as its main technique and brings sufficient results - 70% f-score for both datasets. The second method consists of three models. The first one is obtained by training a model from scratch with Convolution Neural Network (CNN) using Spacy library which results in the best outcome - above 90% f-score for each test dataset. Secondly, a pre-trained multilingual Spacy model is also used to contrast the results, which proves the importance of the domain in which a NER model is trained since this model scored less than 50% in testing. Additionally, a new model has also been trained on top of the multilingual model using training dataset and performs the best in its domain.\",\"PeriodicalId\":150606,\"journal\":{\"name\":\"2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT)\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AICT52784.2021.9620336\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICT52784.2021.9620336","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本研究论文的重点是为低资源语言，即阿塞拜疆语开发一个命名实体识别(NER)系统。本文采用基于规则和基于机器学习两种不同的方法开发了NER模型，并将它们与熟悉和不熟悉的数据集的性能进行了比较，以确定最佳方法。基于规则的方法使用统计作为其主要技术，并带来了足够的结果-两个数据集的70% f得分。第二种方法由三个模型组成。第一个是通过使用空间库从头开始训练卷积神经网络(CNN)的模型获得的，该模型产生了最好的结果——每个测试数据集的f得分都在90%以上。其次，还使用预训练的多语言空间模型来对比结果，这证明了NER模型训练的领域的重要性，因为该模型在测试中的得分低于50%。此外，还利用训练数据集在多语言模型的基础上训练了一个新的模型，并在其领域中表现最好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Named Entity Recognition for the Azerbaijani Language

This research paper focuses on developing a Named Entity Recognition (NER) system for a low-resource language, namely Azerbaijani. The paper develops NER models with two different approaches which are rule-based and machine learning-based approaches and compares the performances of them with familiar and unfamiliar datasets to determine the best approach. The rule-based approach uses statistics as its main technique and brings sufficient results - 70% f-score for both datasets. The second method consists of three models. The first one is obtained by training a model from scratch with Convolution Neural Network (CNN) using Spacy library which results in the best outcome - above 90% f-score for each test dataset. Secondly, a pre-trained multilingual Spacy model is also used to contrast the results, which proves the importance of the domain in which a NER model is trained since this model scored less than 50% in testing. Additionally, a new model has also been trained on top of the multilingual model using training dataset and performs the best in its domain.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT)

自引率

0.00%

发文量