Web-based models for natural language processing

Mirella Lapata, Frank Keller
{"title":"Web-based models for natural language processing","authors":"Mirella Lapata, Frank Keller","doi":"10.1145/1075389.1075392","DOIUrl":null,"url":null,"abstract":"Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-of-the-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"213 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"201","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Trans. Speech Lang. Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1075389.1075392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 201

Abstract

Previous work demonstrated that Web counts can be used to approximate bigram counts, suggesting that Web-based frequencies should be useful for a wide variety of Natural Language Processing (NLP) tasks. However, only a limited number of tasks have so far been tested using Web-scale data sets. The present article overcomes this limitation by systematically investigating the performance of Web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the Web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine Web counts and corpus counts. However, unsupervised Web-based models generally fail to outperform supervised state-of-the-art models trained on smaller corpora. We argue that Web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.
用于自然语言处理的基于web的模型
先前的研究表明,Web计数可以用来近似双元计数,这表明基于Web的频率应该对各种自然语言处理(NLP)任务有用。然而,到目前为止,使用web规模的数据集测试的任务数量有限。本文通过系统地研究基于web的模型在几个NLP任务中的性能,克服了这一限制,包括语法和语义,生成和分析,以及比以前探索的更广泛的n-gram和词性。对于我们的大多数任务,我们发现,当从Web而不是从大型语料库获得n-gram计数时,简单的、无监督的模型表现更好。在某些情况下,通过使用结合Web计数和语料库计数的回退或插值技术,可以进一步提高性能。然而,无监督的基于web的模型通常不能胜过在较小的语料库上训练的有监督的最先进的模型。因此,我们认为基于web的模型应该被用作标准监督模型的基线,而不是替代。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信