阿尔巴尼亚语的词形标记和词源化

SEEU Review Pub Date : 2021-12-01 DOI:10.2478/seeur-2021-0015

D. Mati, Mentor Hamiti, Elissa Mollakuqe

{"title":"阿尔巴尼亚语的词形标记和词源化","authors":"D. Mati, Mentor Hamiti, Elissa Mollakuqe","doi":"10.2478/seeur-2021-0015","DOIUrl":null,"url":null,"abstract":"Abstract An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging. This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them. The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.","PeriodicalId":332987,"journal":{"name":"SEEU Review","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Morphological Tagging and Lemmatization in the Albanian Language\",\"authors\":\"D. Mati, Mentor Hamiti, Elissa Mollakuqe\",\"doi\":\"10.2478/seeur-2021-0015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging. This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them. The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.\",\"PeriodicalId\":332987,\"journal\":{\"name\":\"SEEU Review\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SEEU Review\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/seeur-2021-0015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SEEU Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/seeur-2021-0015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

词性标注是自然语言处理的一个重要内容。使用细粒度的词类注释，可以增强文本中的词形式，也可以在下游流程中使用，例如依赖项解析。标记数据提供的改进的搜索选项也使语言学家和词典编纂者受益匪浅。随着无监督学习方法的发展，自然语言处理的研究变得越来越流行和重要。阿尔巴尼亚语的一些方面使得词性标签集的创建具有挑战性。本研究对这些问题和语言现象进行了讨论，并提出了一个能充分表示这些现象的词性标签集的建议。语料库包含超过250,000个令牌，每个令牌都用一个中等大小的标记集进行注释。阿尔巴尼亚语的组合方面得到了充分的体现。此外，本文还研究了阿尔巴尼亚语的词形和词性标注语料库，以及在这些语料库上训练的词形标注和神经形态标注器。基于hold -out评价集，该模型词性标注准确率达到93.65%，词性标注准确率为85.31%，词性标注准确率为88.95%。此外，TF-IDF技术对术语进行加权，并在分数中突出显示具有阿尔巴尼亚语语料库附加信息的单词。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Morphological Tagging and Lemmatization in the Albanian Language

Abstract An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging. This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them. The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SEEU Review

自引率

0.00%

发文量