Word and Sentence Tokenization with Hidden Markov Models

J. Lang. Technol. Comput. Linguistics Pub Date : 2013-07-01 DOI:10.21248/jlcl.28.2013.176

Bryan Jurish, Kay-Michael Würzner

引用次数: 53

Abstract

We present a novel method (“waste”) for the segmentation of text into tokens and sentences. Our approach makes use of a Hidden Markov Model for the detection of segment boundaries. Model parameters can be estimated from pre-segmented text which is widely available in the form of treebanks or aligned multi-lingual corpora. We formally define the waste boundary detection model and evaluate the system’s performance on corpora from various languages as well as a small corpus of computer-mediated communication.

查看原文本刊更多论文

基于隐马尔可夫模型的单词和句子标记化

我们提出了一种新的方法(“浪费”)，将文本分割为标记和句子。我们的方法利用隐马尔可夫模型来检测段边界。模型参数可以从预分割的文本中估计，这些文本以树库或对齐的多语言语料库的形式广泛存在。我们正式定义了废物边界检测模型，并评估了系统在各种语言语料库以及计算机媒介通信的小型语料库上的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

J. Lang. Technol. Comput. Linguistics

自引率

0.00%

发文量