Unsupervised models for morpheme segmentation and morphology learning

ACM Trans. Speech Lang. Process. Pub Date : 1900-01-01 DOI:10.1145/1187415.1187418

Mathias Creutz, K. Lagus

引用次数: 407

Abstract

We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.

查看原文本刊更多论文

语素切分和形态学学习的无监督模型

我们提出了一个名为Morfessor的模型族，用于从原始文本数据中无监督地归纳简单形态学。该模型是在概率最大值的后验框架中制定的。教授可以处理高度屈折和复杂的语言，这些语言的单词可能由冗长的语素序列组成。从数据中归纳出一个称为词形的词段词典。词汇库存储有关词形的用法和形式的信息。在不同大小的芬兰语和英语数据集的语素分割任务中，对该模型的几个实例进行了定量评估。与广为人知的基准算法相比，Morfessor的表现非常好，尤其是在芬兰的数据上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Trans. Speech Lang. Process.

自引率

0.00%

发文量