基于规则的泰语初级语篇单元分割方法

2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems Pub Date : 2012-11-08 DOI:10.1109/KICSS.2012.33

Nongnuch Ketui, T. Theeramunkong, C. Onsuwan

{"title":"基于规则的泰语初级语篇单元分割方法","authors":"Nongnuch Ketui, T. Theeramunkong, C. Onsuwan","doi":"10.1109/KICSS.2012.33","DOIUrl":null,"url":null,"abstract":"Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.","PeriodicalId":309736,"journal":{"name":"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)\",\"authors\":\"Nongnuch Ketui, T. Theeramunkong, C. Onsuwan\",\"doi\":\"10.1109/KICSS.2012.33\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.\",\"PeriodicalId\":309736,\"journal\":{\"name\":\"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems\",\"volume\":\"93 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/KICSS.2012.33\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KICSS.2012.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

泰语是一种没有词和句子边界的语言，由于其高度的词性歧义和连续动词成分，发现语篇单位并不是一项简单的任务。本文介绍了泰语基本语篇单元的定义、语篇单元分割的语法规则和基于最长匹配的图表解析器。T-EDU定义用于构造一组上下文无关语法(CFG)规则。结果，从1,340个t - edu中构建了446条CFG规则，这些t - edu是从NE和pos标记的语料库Thai-NEST中提取的。这些t - edu由两名语言学家进行评估，kappa得分为0.68。另外，应用了两个级别的评估，一个是在预先分组文本的安排情况下进行的，而另一个是在使用原始运行文本进行测试的正常情况下执行的。通过为每一个T-EDU实例指定一个语法规则，当测试语料库和训练语料库相同时，在封闭环境下可以达到100%的完美召回率，但分块文本和运行文本的召回率分别约为36.16%和31.69%。对于3倍交叉验证的开放测试，召回率约为67%，而精度仅为25-28%。为了提高精度分数，采用了两种备选策略，即从左到右最长匹配(L2R-LM)和最大最长匹配(M-LM)。结果表明，在近距离测试中，L2R-LM和M-LM可以将运行文本的准确率分别提高到93.97%和94.03%。然而，召回率略有下降，分别为94.18%和92.91%。对于开放测试中的运行文本，L2R-LM和M-LM的f分数分别提高到57.70%和54.14%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)

Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems

自引率

0.00%

发文量