A Pragmatic Chinese Word Segmentation Approach Based on Mixing Models

Int. J. Comput. Linguistics Chin. Lang. Process. Pub Date : 2006-12-01 DOI:10.30019/IJCLCLP.200612.0005

Wei Jiang, Yi Guan, Xiaolong Wang

引用次数: 10

Abstract

A pragmatic Chinese word segmentation approach is presented in this paper based on mixing language models. Chinese word segmentation is composed of several hard sub-tasks, which usually encounter different difficulties. The authors apply the corresponding language model to solve each special sub-task, so as to take advantage of each model. First, a class-based trigram is adopted in basic word segmentation, which applies the Absolute Discount Smoothing algorithm to overcome data sparseness. The Maximum Entropy Model (ME) is also used to identify Named Entities. Second, the authors propose the application of rough sets and average mutual information, etc. to extract special features. Finally, some features are extended through the combination of the word cluster and the thesaurus. The authors' system participated in the Second International Chinese Word Segmentation Bakeoff, and achieved 96.7 and 97.2 in F-measure in the PKU and MSRA open tests, respectively.

查看原文本刊更多论文

基于混合模型的汉语语用分词方法

提出了一种基于混合语言模型的汉语语用分词方法。汉语分词是由几个困难的子任务组成的，这些子任务通常会遇到不同的困难。作者采用相应的语言模型来解决每个特定的子任务，从而充分利用每个模型的优势。首先，在基本分词中采用基于类的三分图，利用绝对折扣平滑算法克服数据稀疏性;最大熵模型(ME)也用于识别命名实体。其次，提出了应用粗糙集和平均互信息等方法提取特殊特征的方法。最后，通过词类和同义词库的结合，扩展了一些特征。该系统参加了第二届国际汉语分词大赛，在PKU和MSRA开放测试中分别获得了96.7和97.2的F-measure。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Int. J. Comput. Linguistics Chin. Lang. Process.

自引率

0.00%

发文量