范畴歧义与信息内容:基于语料库的汉语研究

J. Chin. Lang. Comput. Pub Date : 2002-09-01 DOI:10.3115/1118824.1118829

Chu-Ren Huang, Ru-Yng Chang

{"title":"范畴歧义与信息内容:基于语料库的汉语研究","authors":"Chu-Ren Huang, Ru-Yng Chang","doi":"10.3115/1118824.1118829","DOIUrl":null,"url":null,"abstract":"Assignment of grammatical categories is the fundamental step in natural language processing. And ambiguity resolution is one of the most challenging NLP tasks that is currently still beyond the power of machines. When two questions are combined together, the problem of resolution of categorical ambiguity is what a computational linguistic system can do reasonably good, but yet still unable to mimic the excellence of human beings. This task is even more challenging in Chinese language processing because of the poverty of morphological information to mark categories and the lack of convention to mark word boundaries. In this paper, we try to investigate the nature of categorical ambiguity in Chinese based on Sinica Corpus. The study differs crucially from previous studies in that it directly measure information content as the degree of ambiguity. This method not only offers an alternative interpretation of ambiguity, it also allows a different measure of success of categorical disambiguation. Instead of precision or recall, we can also measure by how much the information load has been reduced. This approach also allows us to identify which are the most ambiguous words in terms of information content. The somewhat surprising result actually reinforces the Saussurian view that underlying the systemic linguistic structure, assignment of linguistic content for each linguistic symbol is arbitrary.","PeriodicalId":262574,"journal":{"name":"J. Chin. Lang. Comput.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Categorical Ambiguity and Information Content: A Corpus-based Study of Chinese\",\"authors\":\"Chu-Ren Huang, Ru-Yng Chang\",\"doi\":\"10.3115/1118824.1118829\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Assignment of grammatical categories is the fundamental step in natural language processing. And ambiguity resolution is one of the most challenging NLP tasks that is currently still beyond the power of machines. When two questions are combined together, the problem of resolution of categorical ambiguity is what a computational linguistic system can do reasonably good, but yet still unable to mimic the excellence of human beings. This task is even more challenging in Chinese language processing because of the poverty of morphological information to mark categories and the lack of convention to mark word boundaries. In this paper, we try to investigate the nature of categorical ambiguity in Chinese based on Sinica Corpus. The study differs crucially from previous studies in that it directly measure information content as the degree of ambiguity. This method not only offers an alternative interpretation of ambiguity, it also allows a different measure of success of categorical disambiguation. Instead of precision or recall, we can also measure by how much the information load has been reduced. This approach also allows us to identify which are the most ambiguous words in terms of information content. The somewhat surprising result actually reinforces the Saussurian view that underlying the systemic linguistic structure, assignment of linguistic content for each linguistic symbol is arbitrary.\",\"PeriodicalId\":262574,\"journal\":{\"name\":\"J. Chin. Lang. Comput.\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Chin. Lang. Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3115/1118824.1118829\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Chin. Lang. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3115/1118824.1118829","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

语法范畴的划分是自然语言处理的基本步骤。歧义解决是最具挑战性的NLP任务之一，目前仍然超出了机器的能力。当两个问题结合在一起时，解决分类歧义的问题是计算语言系统可以做得相当好的问题，但仍然无法模仿人类的卓越。在汉语语言处理中，由于缺乏标记类别的形态学信息和缺乏标记词边界的惯例，这一任务更具挑战性。本文以中研所语料库为基础，探讨汉语范畴歧义的本质。这项研究与之前的研究有很大的不同，因为它直接将信息内容作为模糊程度来衡量。这种方法不仅提供了对歧义的另一种解释，而且还允许对分类消歧的成功进行不同的衡量。除了精确度或召回率，我们还可以通过减少了多少信息负荷来衡量。这种方法还允许我们识别在信息内容方面哪些是最模棱两可的单词。这个有点令人惊讶的结果实际上强化了索绪尔的观点，即在系统语言结构的基础上，每个语言符号的语言内容分配是任意的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Categorical Ambiguity and Information Content: A Corpus-based Study of Chinese

Assignment of grammatical categories is the fundamental step in natural language processing. And ambiguity resolution is one of the most challenging NLP tasks that is currently still beyond the power of machines. When two questions are combined together, the problem of resolution of categorical ambiguity is what a computational linguistic system can do reasonably good, but yet still unable to mimic the excellence of human beings. This task is even more challenging in Chinese language processing because of the poverty of morphological information to mark categories and the lack of convention to mark word boundaries. In this paper, we try to investigate the nature of categorical ambiguity in Chinese based on Sinica Corpus. The study differs crucially from previous studies in that it directly measure information content as the degree of ambiguity. This method not only offers an alternative interpretation of ambiguity, it also allows a different measure of success of categorical disambiguation. Instead of precision or recall, we can also measure by how much the information load has been reduced. This approach also allows us to identify which are the most ambiguous words in terms of information content. The somewhat surprising result actually reinforces the Saussurian view that underlying the systemic linguistic structure, assignment of linguistic content for each linguistic symbol is arbitrary.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

J. Chin. Lang. Comput.

自引率

0.00%

发文量