Text corpus for natural language story-telling sentence generation: A design and evaluation

2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE) Pub Date : 2014-05-14 DOI:10.1109/JCSSE.2014.6841846

W. Limpanadusadee, P. Punyabukkana, A. Suchato, Onintra Poobrasert

{"title":"Text corpus for natural language story-telling sentence generation: A design and evaluation","authors":"W. Limpanadusadee, P. Punyabukkana, A. Suchato, Onintra Poobrasert","doi":"10.1109/JCSSE.2014.6841846","DOIUrl":null,"url":null,"abstract":"Automatic generation of narrative sentences from unordered word sets is desirable in Augmentative and Alternative Communication (AAC) systems for children with certain learning disabilities (LD). Regardless of the complexity of the Natural Language Processing deployed in sentence generation procedures, the qualities of language models always affect the generation results. This work compared sentence generation accuracies obtained from a multi-tier N-gram-based procedure trained on BEST2010, a large publicly available text corpus, and a smaller but more specifically designed corpus in the task of Thai simple sentence generation. The latter, a new corpus called TELL-S, was created based on an analysis of the contents belonging to textbooks used in grade 1 and grade 2 for Thai language subjects according to the compulsory curriculum for Thai schools. The original procedure was also modified to incorporate additional constraints based on a story-telling guideline developed for LD children. Evaluated upon test sets of 195 sentences, each of which was composed of 3-6 words with a specific Part-Of-Speech combination, TELL-S was shown to provide better generalization and yielded higher accuracies than BEST2010 in all cases with unbiased word sets. The sentence generation accuracies were 100% and 70% for 3-word and 4-word sentences, respectively. The average accuracy was at 58.8% when longer sentences were also included.","PeriodicalId":331610,"journal":{"name":"2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2014.6841846","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Automatic generation of narrative sentences from unordered word sets is desirable in Augmentative and Alternative Communication (AAC) systems for children with certain learning disabilities (LD). Regardless of the complexity of the Natural Language Processing deployed in sentence generation procedures, the qualities of language models always affect the generation results. This work compared sentence generation accuracies obtained from a multi-tier N-gram-based procedure trained on BEST2010, a large publicly available text corpus, and a smaller but more specifically designed corpus in the task of Thai simple sentence generation. The latter, a new corpus called TELL-S, was created based on an analysis of the contents belonging to textbooks used in grade 1 and grade 2 for Thai language subjects according to the compulsory curriculum for Thai schools. The original procedure was also modified to incorporate additional constraints based on a story-telling guideline developed for LD children. Evaluated upon test sets of 195 sentences, each of which was composed of 3-6 words with a specific Part-Of-Speech combination, TELL-S was shown to provide better generalization and yielded higher accuracies than BEST2010 in all cases with unbiased word sets. The sentence generation accuracies were 100% and 70% for 3-word and 4-word sentences, respectively. The average accuracy was at 58.8% when longer sentences were also included.

查看原文本刊更多论文

自然语言故事句子生成的文本语料库设计与评价

从无序词集中自动生成叙事句子是有学习障碍儿童的辅助和替代交流(AAC)系统的需要。无论自然语言处理在句子生成过程中的应用多么复杂，语言模型的质量总是影响生成结果。这项工作比较了在BEST2010(一个大型公开可用的文本语料库)和一个较小但更专门设计的语料库上训练的多层n -gram程序在泰语简单句生成任务中获得的句子生成准确性。后者是一个名为TELL-S的新语料库，是根据泰国学校必修课程对一年级和二年级泰语科目教科书的内容进行分析而创建的。最初的程序也进行了修改，纳入了基于为残疾儿童制定的讲故事指南的附加约束。在195个句子的测试集上进行评估，每个句子由3-6个具有特定词性组合的单词组成，TELL-S在无偏词集的所有情况下都比BEST2010提供了更好的泛化和更高的准确性。3个词和4个词的句子生成准确率分别为100%和70%。当包含较长的句子时，平均准确率为58.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)

自引率

0.00%

发文量