An Overview of Corpus-Based Statistics-Oriented(CBSO) Techniques for Natural Language Processing

Int. J. Comput. Linguistics Chin. Lang. Process. Pub Date : 1996-08-01 DOI:10.30019/IJCLCLP.199608.0004

Keh-Yih Su, Tung-Hui Chiang, Jing-Shin Chang

{"title":"An Overview of Corpus-Based Statistics-Oriented(CBSO) Techniques for Natural Language Processing","authors":"Keh-Yih Su, Tung-Hui Chiang, Jing-Shin Chang","doi":"10.30019/IJCLCLP.199608.0004","DOIUrl":null,"url":null,"abstract":"A Corpus-Based Statistics-Oriented (CBSO) methodology, which is an attempt to avoid the drawbacks of traditional rule-based approaches and purely statistical approaches, is introduced in this paper. Rule-based approaches, with rules induced by human experts, had been the dominant paradigm in the natural language processing community. Such approaches, however, suffer from serious difficulties in knowledge acquisition in terms of cost and consistency. Therefore, it is very difficult for such systems to be scaled-up. Statistical methods, with the capability of automatically acquiring knowledge from corpora, are becoming more and more popular, in part, to amend the shortcomings of rule-based approaches. However, most simple statistical models, which adopt almost nothing from existing linguistic knowledge, often result in a large parameter space and, thus, require an unaffordably large training corpus for even well-justified linguistic phenomena. The corpus-based statistics-oriented (CBSO) approach is a compromise between the two extremes of the spectrum for knowledge acquisition. CBSO approach emphasizes use of well-justified linguistic knowledge in developing the underlying language model and application of statistical optimization techniques on top of high level constructs, such as annotated syntax trees, rather than on surface strings, so that only a training corpus of reasonable size is needed for training and long distance dependency between constituents could be handled. In this paper, corpus-based statistics-oriented techniques are reviewed. General techniques applicable to CBSO approaches are introduced. In particular, we shall address the following important issues: (1) general tasks in developing an NLP system; (2) why CBSO is the preferred choice among different strategies; (3) how to achieve good performance systematically using a CBSO approach, and (4) frequently used CBSO techniques. Several examples are also reviewed.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1996-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Linguistics Chin. Lang. Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30019/IJCLCLP.199608.0004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

A Corpus-Based Statistics-Oriented (CBSO) methodology, which is an attempt to avoid the drawbacks of traditional rule-based approaches and purely statistical approaches, is introduced in this paper. Rule-based approaches, with rules induced by human experts, had been the dominant paradigm in the natural language processing community. Such approaches, however, suffer from serious difficulties in knowledge acquisition in terms of cost and consistency. Therefore, it is very difficult for such systems to be scaled-up. Statistical methods, with the capability of automatically acquiring knowledge from corpora, are becoming more and more popular, in part, to amend the shortcomings of rule-based approaches. However, most simple statistical models, which adopt almost nothing from existing linguistic knowledge, often result in a large parameter space and, thus, require an unaffordably large training corpus for even well-justified linguistic phenomena. The corpus-based statistics-oriented (CBSO) approach is a compromise between the two extremes of the spectrum for knowledge acquisition. CBSO approach emphasizes use of well-justified linguistic knowledge in developing the underlying language model and application of statistical optimization techniques on top of high level constructs, such as annotated syntax trees, rather than on surface strings, so that only a training corpus of reasonable size is needed for training and long distance dependency between constituents could be handled. In this paper, corpus-based statistics-oriented techniques are reviewed. General techniques applicable to CBSO approaches are introduced. In particular, we shall address the following important issues: (1) general tasks in developing an NLP system; (2) why CBSO is the preferred choice among different strategies; (3) how to achieve good performance systematically using a CBSO approach, and (4) frequently used CBSO techniques. Several examples are also reviewed.

查看原文本刊更多论文

基于语料库的面向统计(CBSO)自然语言处理技术综述

本文介绍了一种基于语料库的面向统计(CBSO)方法，它试图避免传统的基于规则的方法和纯统计方法的缺点。基于规则的方法，由人类专家制定规则，一直是自然语言处理领域的主导范式。然而，就成本和一致性而言，这种方法在知识获取方面存在严重困难。因此，这样的系统很难扩大规模。具有从语料库中自动获取知识能力的统计方法越来越受欢迎，部分原因是为了弥补基于规则的方法的不足。然而，大多数简单的统计模型几乎没有采用任何现有的语言知识，往往导致一个大的参数空间，因此，即使是合理的语言现象，也需要一个难以负担的大的训练语料库。基于语料库的面向统计(CBSO)方法是知识获取光谱的两个极端之间的折衷。CBSO方法强调在开发底层语言模型时使用合理的语言知识，并在高级结构(如注释语法树)之上应用统计优化技术，而不是在表面字符串上，因此只需要一个合理大小的训练语料库进行训练，并且可以处理成分之间的长距离依赖关系。本文综述了基于语料库的面向统计技术。介绍了适用于CBSO方法的一般技术。我们将特别讨论以下重要问题:(1)开发自然语言处理系统的一般任务;(2)为什么CBSO是不同策略的首选;(3)如何使用CBSO方法系统地实现良好的性能;(4)经常使用的CBSO技术。还回顾了几个例子。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Int. J. Comput. Linguistics Chin. Lang. Process.

自引率

0.00%

发文量