An unsupervised language model adaptation based on keyword clustering and query availability estimation

2008 International Conference on Audio, Language and Image Processing Pub Date : 2008-07-07 DOI:10.1109/ICALIP.2008.4590103

A. Ito, Y. Kajiura, S. Makino, M. Suzuki

引用次数: 5

Abstract

Language model adaptation using text data downloaded from the WWW is an efficient way to train a topic-specific LM. We are developing an unsupervised LM adaptation method using data in the Web. The one key point of unsupervised Web-based LM adaptation is how to select keywords to compose the search query. In this paper, we propose a new method of selecting keywords from keyword candidates, which uses a keyword clustering technique based on word similarities. The other key point is how to determine the number of downloaded pages for each query. In this paper we propose a method to estimate "a query availability," which is based on a small number of downloaded Web pages. The experimental result showed that the determination of downloaded pages using the query availability was effective than the conventional methods that determined the number of pages empirically.

查看原文本刊更多论文

基于关键词聚类和查询可用性估计的无监督语言模型自适应

使用从WWW下载的文本数据自适应语言模型是训练特定主题LM的有效方法。我们正在开发一种使用网络数据的无监督LM自适应方法。基于web的无监督LM自适应的一个关键问题是如何选择关键字来组成搜索查询。本文提出了一种基于词相似度的关键词聚类技术，从候选关键词中选择关键词的新方法。另一个关键点是如何确定每个查询下载的页面数量。在本文中，我们提出了一种估算“查询可用性”的方法，该方法基于少量下载的Web页面。实验结果表明，利用查询可用性确定下载页面的方法比传统的经验确定页面数量的方法更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 International Conference on Audio, Language and Image Processing

自引率

0.00%

发文量