使用迭代方法的阿尔巴尼亚语NLTK标注器

Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces Pub Date : 2013-06-24 DOI:10.2498/iti.2013.0565

A. Kadriu

{"title":"使用迭代方法的阿尔巴尼亚语NLTK标注器","authors":"A. Kadriu","doi":"10.2498/iti.2013.0565","DOIUrl":null,"url":null,"abstract":"This paper presents a research done about a model of tagging for Albanian texts, using the NLTK toolkit. The model uses cascading of three taggers with backoff. We use a dictionary of around 32000 words, together their correspondent POS tags and a set of regular expressions rules too. A lemmatize module is implemented in order to convert nouns and verbs to their lemma. The text is tagged initially with a unigram tagger based on the dictionary. This is used as a baseline tagger for a regular expressions tagger. A correction is made for not correct lemmatized words, creating a third lookup tagger. This tagger will be used with the first and second tagger as backoff.","PeriodicalId":262789,"journal":{"name":"Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"NLTK tagger for Albanian using iterative approach\",\"authors\":\"A. Kadriu\",\"doi\":\"10.2498/iti.2013.0565\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a research done about a model of tagging for Albanian texts, using the NLTK toolkit. The model uses cascading of three taggers with backoff. We use a dictionary of around 32000 words, together their correspondent POS tags and a set of regular expressions rules too. A lemmatize module is implemented in order to convert nouns and verbs to their lemma. The text is tagged initially with a unigram tagger based on the dictionary. This is used as a baseline tagger for a regular expressions tagger. A correction is made for not correct lemmatized words, creating a third lookup tagger. This tagger will be used with the first and second tagger as backoff.\",\"PeriodicalId\":262789,\"journal\":{\"name\":\"Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2498/iti.2013.0565\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2498/iti.2013.0565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

本文介绍了一项关于阿尔巴尼亚语文本标记模型的研究，使用NLTK工具包。该模型使用带后退的三个标签级联。我们使用大约32000个单词的字典，以及它们对应的POS标记和一组正则表达式规则。为了将名词和动词转换为它们的引理，实现了一个引理化模块。最初使用基于字典的单字符标记器对文本进行标记。它被用作正则表达式标记器的基线标记器。对词序不正确的单词进行更正，创建第三个查找标记器。此标记器将与第一和第二标记器一起用作后退。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

NLTK tagger for Albanian using iterative approach

This paper presents a research done about a model of tagging for Albanian texts, using the NLTK toolkit. The model uses cascading of three taggers with backoff. We use a dictionary of around 32000 words, together their correspondent POS tags and a set of regular expressions rules too. A lemmatize module is implemented in order to convert nouns and verbs to their lemma. The text is tagged initially with a unigram tagger based on the dictionary. This is used as a baseline tagger for a regular expressions tagger. A correction is made for not correct lemmatized words, creating a third lookup tagger. This tagger will be used with the first and second tagger as backoff.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces

自引率

0.00%

发文量