基于规则的泰语初级语篇单元分割方法

Nongnuch Ketui, T. Theeramunkong, C. Onsuwan
{"title":"基于规则的泰语初级语篇单元分割方法","authors":"Nongnuch Ketui, T. Theeramunkong, C. Onsuwan","doi":"10.1109/KICSS.2012.33","DOIUrl":null,"url":null,"abstract":"Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.","PeriodicalId":309736,"journal":{"name":"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2012-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)\",\"authors\":\"Nongnuch Ketui, T. Theeramunkong, C. Onsuwan\",\"doi\":\"10.1109/KICSS.2012.33\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.\",\"PeriodicalId\":309736,\"journal\":{\"name\":\"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/KICSS.2012.33\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KICSS.2012.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

摘要

泰语是一种没有词和句子边界的语言,由于其高度的词性歧义和连续动词成分,发现语篇单位并不是一项简单的任务。本文介绍了泰语基本语篇单元的定义、语篇单元分割的语法规则和基于最长匹配的图表解析器。T-EDU定义用于构造一组上下文无关语法(CFG)规则。结果,从1,340个t - edu中构建了446条CFG规则,这些t - edu是从NE和pos标记的语料库Thai-NEST中提取的。这些t - edu由两名语言学家进行评估,kappa得分为0.68。另外,应用了两个级别的评估,一个是在预先分组文本的安排情况下进行的,而另一个是在使用原始运行文本进行测试的正常情况下执行的。通过为每一个T-EDU实例指定一个语法规则,当测试语料库和训练语料库相同时,在封闭环境下可以达到100%的完美召回率,但分块文本和运行文本的召回率分别约为36.16%和31.69%。对于3倍交叉验证的开放测试,召回率约为67%,而精度仅为25-28%。为了提高精度分数,采用了两种备选策略,即从左到右最长匹配(L2R-LM)和最大最长匹配(M-LM)。结果表明,在近距离测试中,L2R-LM和M-LM可以将运行文本的准确率分别提高到93.97%和94.03%。然而,召回率略有下降,分别为94.18%和92.91%。对于开放测试中的运行文本,L2R-LM和M-LM的f分数分别提高到57.70%和54.14%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)
Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信