汉语长距离语境依赖的建模

Workshop on Chinese Language Processing Pub Date : 2003-07-11 DOI:10.3115/1119250.1119260

Guodong Zhou

{"title":"汉语长距离语境依赖的建模","authors":"Guodong Zhou","doi":"10.3115/1119250.1119260","DOIUrl":null,"url":null,"abstract":"Ngram modeling is simple in language modeling and has been widely used in many applications. However, it can only capture the short distance context dependency within an N-word window where the largest practical N for natural language is three. In the meantime, much of context dependency in natural language occurs beyond a three-word window. In order to incorporate this kind of long distance context dependency, this paper proposes a new MI-Ngram modeling approach. The MI-Ngram model consists of two components: an ngram model and an MI model. The ngram model captures the short distance context dependency within an N-word window while the MI model captures the long distance context dependency between the word pairs beyond the N-word window by using the concept of mutual information. It is found that MI-Ngram modeling has much better performance than ngram modeling. Evaluation on the XINHUA new corpus of 29 million words shows that inclusion of the best 1,600,000 word pairs decreases the perplexity of the MI-Trigram model by 20 percent compared with the trigram model. In the meanwhile, evaluation on Chinese word segmentation shows that about 35 percent of errors can be corrected by using the MI-Trigram model compared with the trigram model.","PeriodicalId":403123,"journal":{"name":"Workshop on Chinese Language Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Modeling of Long Distance Context Dependency in Chinese\",\"authors\":\"Guodong Zhou\",\"doi\":\"10.3115/1119250.1119260\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Ngram modeling is simple in language modeling and has been widely used in many applications. However, it can only capture the short distance context dependency within an N-word window where the largest practical N for natural language is three. In the meantime, much of context dependency in natural language occurs beyond a three-word window. In order to incorporate this kind of long distance context dependency, this paper proposes a new MI-Ngram modeling approach. The MI-Ngram model consists of two components: an ngram model and an MI model. The ngram model captures the short distance context dependency within an N-word window while the MI model captures the long distance context dependency between the word pairs beyond the N-word window by using the concept of mutual information. It is found that MI-Ngram modeling has much better performance than ngram modeling. Evaluation on the XINHUA new corpus of 29 million words shows that inclusion of the best 1,600,000 word pairs decreases the perplexity of the MI-Trigram model by 20 percent compared with the trigram model. In the meanwhile, evaluation on Chinese word segmentation shows that about 35 percent of errors can be corrected by using the MI-Trigram model compared with the trigram model.\",\"PeriodicalId\":403123,\"journal\":{\"name\":\"Workshop on Chinese Language Processing\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2003-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Chinese Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3115/1119250.1119260\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Chinese Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3115/1119250.1119260","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

Ngram建模是一种简单的语言建模方法，在许多应用中得到了广泛的应用。然而，它只能在N个单词的窗口内捕获短距离上下文依赖，其中自然语言的最大实际N是3。与此同时，自然语言中的许多上下文依赖关系发生在三个单词的窗口之外。为了整合这种长距离上下文依赖，本文提出了一种新的MI-Ngram建模方法。MI- ngram模型由两个部分组成:ngram模型和MI模型。ngram模型捕获n个单词窗口内的短距离上下文依赖关系，而MI模型通过使用互信息的概念捕获n个单词窗口外单词对之间的长距离上下文依赖关系。研究发现，MI-Ngram建模比ngram建模具有更好的性能。对新华社新语料库2900万词的评估表明，与三格表模型相比，包含最好的160万个词对的MI-Trigram模型的困惑度降低了20%。与此同时，对汉语分词的评价表明，使用MI-Trigram模型与使用trigram模型相比，可以纠正约35%的错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Modeling of Long Distance Context Dependency in Chinese

Ngram modeling is simple in language modeling and has been widely used in many applications. However, it can only capture the short distance context dependency within an N-word window where the largest practical N for natural language is three. In the meantime, much of context dependency in natural language occurs beyond a three-word window. In order to incorporate this kind of long distance context dependency, this paper proposes a new MI-Ngram modeling approach. The MI-Ngram model consists of two components: an ngram model and an MI model. The ngram model captures the short distance context dependency within an N-word window while the MI model captures the long distance context dependency between the word pairs beyond the N-word window by using the concept of mutual information. It is found that MI-Ngram modeling has much better performance than ngram modeling. Evaluation on the XINHUA new corpus of 29 million words shows that inclusion of the best 1,600,000 word pairs decreases the perplexity of the MI-Trigram model by 20 percent compared with the trigram model. In the meanwhile, evaluation on Chinese word segmentation shows that about 35 percent of errors can be corrected by using the MI-Trigram model compared with the trigram model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Chinese Language Processing

自引率

0.00%

发文量