Web-based models for natural language processing

ACM Trans. Speech Lang. Process. Pub Date : 2005-02-01 DOI:10.1145/1075389.1075392

Mirella Lapata, Frank Keller

引用次数: 201

Abstract

Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-of-the-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.

查看原文本刊更多论文

用于自然语言处理的基于web的模型

先前的研究表明，Web计数可以用来近似双元计数，这表明基于Web的频率应该对各种自然语言处理(NLP)任务有用。然而，到目前为止，使用web规模的数据集测试的任务数量有限。本文通过系统地研究基于web的模型在几个NLP任务中的性能，克服了这一限制，包括语法和语义，生成和分析，以及比以前探索的更广泛的n-gram和词性。对于我们的大多数任务，我们发现，当从Web而不是从大型语料库获得n-gram计数时，简单的、无监督的模型表现更好。在某些情况下，通过使用结合Web计数和语料库计数的回退或插值技术，可以进一步提高性能。然而，无监督的基于web的模型通常不能胜过在较小的语料库上训练的有监督的最先进的模型。因此，我们认为基于web的模型应该被用作标准监督模型的基线，而不是替代。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Trans. Speech Lang. Process.

自引率

0.00%

发文量