Research on recognition of semantic chunk boundary in Tibetan

2014 International Conference on Asian Language Processing (IALP) Pub Date : 2014-12-04 DOI:10.1109/IALP.2014.6973476

Tianhang Wang, Shumin Shi, Heyan Huang, Congjun Long, Ruijing Li

{"title":"Research on recognition of semantic chunk boundary in Tibetan","authors":"Tianhang Wang, Shumin Shi, Heyan Huang, Congjun Long, Ruijing Li","doi":"10.1109/IALP.2014.6973476","DOIUrl":null,"url":null,"abstract":"Semantic chunk is able to well describe the sentence semantic framework. It plays a very important role in Natural Language Processing applications, such as machine translation, QA system and so on. At present, the Tibetan chunk researches are mainly based on rule-methods. In this paper, according to the distinctive language characteristics of Tibetan, we firstly put forward the descriptive definition of the Tibetan semantic chunk and its labeling scheme and then we propose a feature selection algorithm to select the suitable ones automatically from the candidate feature-templates. Through the experiment conducted on the two different kinds of Tibetan corpus, namely corpus-sentence and corpus-discourse, the F-Measure achieves 95.84%, 94.95% and 91.97%, 88.82% by using of Conditional Random Fields (CRF) model and Maximum Entropy (ME) model respectively. The positive results show that the definition of Tibetan semantic chunk in this paper is reasonable and operable. Furthermore, its boundary recognition is feasible and effective via statistical techniques in small scale corpus.","PeriodicalId":117334,"journal":{"name":"2014 International Conference on Asian Language Processing (IALP)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2014.6973476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Semantic chunk is able to well describe the sentence semantic framework. It plays a very important role in Natural Language Processing applications, such as machine translation, QA system and so on. At present, the Tibetan chunk researches are mainly based on rule-methods. In this paper, according to the distinctive language characteristics of Tibetan, we firstly put forward the descriptive definition of the Tibetan semantic chunk and its labeling scheme and then we propose a feature selection algorithm to select the suitable ones automatically from the candidate feature-templates. Through the experiment conducted on the two different kinds of Tibetan corpus, namely corpus-sentence and corpus-discourse, the F-Measure achieves 95.84%, 94.95% and 91.97%, 88.82% by using of Conditional Random Fields (CRF) model and Maximum Entropy (ME) model respectively. The positive results show that the definition of Tibetan semantic chunk in this paper is reasonable and operable. Furthermore, its boundary recognition is feasible and effective via statistical techniques in small scale corpus.

查看原文本刊更多论文

藏文语义块边界识别研究

语义块能够很好地描述句子的语义框架。它在机器翻译、QA系统等自然语言处理应用中起着重要的作用。目前，藏文语块的研究主要基于规则方法。本文根据藏文鲜明的语言特征，首先提出了藏文语义块的描述性定义及其标注方案，然后提出了一种特征选择算法，从候选特征模板中自动选择合适的语义块。通过对两种不同的藏语语料库，即语料库-句子和语料库-话语进行实验，使用条件随机场(CRF)模型和最大熵(ME)模型，F-Measure分别达到95.84%、94.95%和91.97%、88.82%。实验结果表明，本文提出的藏文语义块定义是合理的、可操作的。此外，在小尺度语料库中利用统计技术进行边界识别是可行和有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 International Conference on Asian Language Processing (IALP)

自引率

0.00%

发文量