NLTK tagger for Albanian using iterative approach

Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces Pub Date : 2013-06-24 DOI:10.2498/iti.2013.0565

A. Kadriu

引用次数: 13

Abstract

This paper presents a research done about a model of tagging for Albanian texts, using the NLTK toolkit. The model uses cascading of three taggers with backoff. We use a dictionary of around 32000 words, together their correspondent POS tags and a set of regular expressions rules too. A lemmatize module is implemented in order to convert nouns and verbs to their lemma. The text is tagged initially with a unigram tagger based on the dictionary. This is used as a baseline tagger for a regular expressions tagger. A correction is made for not correct lemmatized words, creating a third lookup tagger. This tagger will be used with the first and second tagger as backoff.

查看原文本刊更多论文

使用迭代方法的阿尔巴尼亚语NLTK标注器

本文介绍了一项关于阿尔巴尼亚语文本标记模型的研究，使用NLTK工具包。该模型使用带后退的三个标签级联。我们使用大约32000个单词的字典，以及它们对应的POS标记和一组正则表达式规则。为了将名词和动词转换为它们的引理，实现了一个引理化模块。最初使用基于字典的单字符标记器对文本进行标记。它被用作正则表达式标记器的基线标记器。对词序不正确的单词进行更正，创建第三个查找标记器。此标记器将与第一和第二标记器一起用作后退。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces

自引率

0.00%

发文量