Isarn Dharma word segmentation

2013 International Conference on Control, Automation and Information Sciences (ICCAIS) Pub Date : 2013-11-01 DOI:10.1109/ICCAIS.2013.6720529

Sittichai Somsap, Pusadee Seresangtakul

引用次数: 4

Abstract

This paper presents Isarn Dhama word segmentation based on the Isarn Dharma writing system and dictionary. In this study, input text is segmented into sequences of Isarn Dharma Character Clusters (IDCCs). Each IDCC represents a group of inseparable Isarn Dharma characters based on the Isarn Dharma writing system. The sequence of IDCCs will be considered as input in order to look for the most suitable segmentation word from the dictionary using the IDCC longest matching algorithm. Grouping rules were then used to group adjacent remaining IDCCs that do not match an Isarn word in the dictionary. In order to evaluate the efficiency of the proposed technique, Isarn literature, Jataka, legend and Buddha foretell were used as the testing data to test the proposed system; comparing with longest matching and a hybrid of the IDCC longest matching. The experiment results showed that the F-measures are 80.15%, 85.06% and 86.07% for the longest matching, the IDCC longest matching algorithm, and the proposed method, respectively.

查看原文本刊更多论文

Isarn法分词

本文提出了一种基于Isarn Dharma文字系统和字典的Isarn Dharma分词方法。在本研究中，输入文本被分割成Isarn Dharma字符簇(idcc)序列。每个IDCC代表一组不可分割的依萨文法字，以依萨文法文书写系统为基础。为了使用IDCC最长匹配算法从字典中寻找最合适的分词，将IDCC序列作为输入。然后使用分组规则对字典中不匹配Isarn单词的相邻剩余idcc进行分组。以Isarn文献、Jataka、legend和Buddha prediction作为测试数据，对所提出的系统进行测试，以评估所提出技术的有效性;比较了最长匹配和混合的IDCC最长匹配。实验结果表明，最长匹配、IDCC最长匹配算法和本文方法的f值分别为80.15%、85.06%和86.07%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 International Conference on Control, Automation and Information Sciences (ICCAIS)

自引率

0.00%

发文量