Isarn Dharma word segmentation

Sittichai Somsap, Pusadee Seresangtakul
{"title":"Isarn Dharma word segmentation","authors":"Sittichai Somsap, Pusadee Seresangtakul","doi":"10.1109/ICCAIS.2013.6720529","DOIUrl":null,"url":null,"abstract":"This paper presents Isarn Dhama word segmentation based on the Isarn Dharma writing system and dictionary. In this study, input text is segmented into sequences of Isarn Dharma Character Clusters (IDCCs). Each IDCC represents a group of inseparable Isarn Dharma characters based on the Isarn Dharma writing system. The sequence of IDCCs will be considered as input in order to look for the most suitable segmentation word from the dictionary using the IDCC longest matching algorithm. Grouping rules were then used to group adjacent remaining IDCCs that do not match an Isarn word in the dictionary. In order to evaluate the efficiency of the proposed technique, Isarn literature, Jataka, legend and Buddha foretell were used as the testing data to test the proposed system; comparing with longest matching and a hybrid of the IDCC longest matching. The experiment results showed that the F-measures are 80.15%, 85.06% and 86.07% for the longest matching, the IDCC longest matching algorithm, and the proposed method, respectively.","PeriodicalId":347974,"journal":{"name":"2013 International Conference on Control, Automation and Information Sciences (ICCAIS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Control, Automation and Information Sciences (ICCAIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAIS.2013.6720529","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

This paper presents Isarn Dhama word segmentation based on the Isarn Dharma writing system and dictionary. In this study, input text is segmented into sequences of Isarn Dharma Character Clusters (IDCCs). Each IDCC represents a group of inseparable Isarn Dharma characters based on the Isarn Dharma writing system. The sequence of IDCCs will be considered as input in order to look for the most suitable segmentation word from the dictionary using the IDCC longest matching algorithm. Grouping rules were then used to group adjacent remaining IDCCs that do not match an Isarn word in the dictionary. In order to evaluate the efficiency of the proposed technique, Isarn literature, Jataka, legend and Buddha foretell were used as the testing data to test the proposed system; comparing with longest matching and a hybrid of the IDCC longest matching. The experiment results showed that the F-measures are 80.15%, 85.06% and 86.07% for the longest matching, the IDCC longest matching algorithm, and the proposed method, respectively.
Isarn法分词
本文提出了一种基于Isarn Dharma文字系统和字典的Isarn Dharma分词方法。在本研究中,输入文本被分割成Isarn Dharma字符簇(idcc)序列。每个IDCC代表一组不可分割的依萨文法字,以依萨文法文书写系统为基础。为了使用IDCC最长匹配算法从字典中寻找最合适的分词,将IDCC序列作为输入。然后使用分组规则对字典中不匹配Isarn单词的相邻剩余idcc进行分组。以Isarn文献、Jataka、legend和Buddha prediction作为测试数据,对所提出的系统进行测试,以评估所提出技术的有效性;比较了最长匹配和混合的IDCC最长匹配。实验结果表明,最长匹配、IDCC最长匹配算法和本文方法的f值分别为80.15%、85.06%和86.07%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信