A Unicode-based Adaptive Segmenter

Journal of Chinese Language and Computing Pub Date : 2003-07-11 DOI:10.3115/1119250.1119275

Q. Lu, Shiu-tong Chan, Baoli Li, Shiwen Yu

引用次数: 13

Abstract

This paper presents a Unicode based Chinese word segmentor. It can handle Chinese text in Simplified, Traditional, or mixed mode. The system uses the strategy of divide-and-conquer to handle the recognition of personal names, numbers, time and numerical values, etc in the preprocessing stage. The segmentor further uses tagging information to work on disambiguation. Adopting a modular design approach, different functional parts are separately implemented using different modules and each module tackles one problem at a time providing more flexibility and extensibility. Results show that with added pre-processing modules and accessorial modules, the accuracy of the segmentor is increased and the system is easily adaptive to different applications.

查看原文本刊更多论文

基于unicode的自适应分段器

提出了一种基于Unicode的汉语分词器。它可以处理简体、繁体或混合模式的中文文本。该系统在预处理阶段采用分治策略对人名、数字、时间、数值等进行识别。分词器进一步使用标记信息来消除歧义。采用模块化设计方法，使用不同的模块分别实现不同的功能部分，每个模块一次处理一个问题，从而提供更多的灵活性和可扩展性。结果表明，通过增加预处理模块和辅助模块，可以提高分割器的精度，使系统易于适应不同的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Chinese Language and Computing

自引率

0.00%

发文量