{"title":"Estimating Embedding Vectors for Queries","authors":"Hamed Zamani, W. Bruce Croft","doi":"10.1145/2970398.2970403","DOIUrl":null,"url":null,"abstract":"The dense vector representation of vocabulary terms, also known as word embeddings, have been shown to be highly effective in many natural language processing tasks. Word embeddings have recently begun to be studied in a number of information retrieval (IR) tasks. One of the main steps in leveraging word embeddings for IR tasks is to estimate the embedding vectors of queries. This is a challenging task, since queries are not always available during the training phase of word embedding vectors. Previous work has considered the average or sum of embedding vectors of all query terms (AWE) to model the query embedding vectors, but no theoretical justification has been presented for such a model. In this paper, we propose a theoretical framework for estimating query embedding vectors based on the individual embedding vectors of vocabulary terms. We then provide a number of different implementations of this framework and show that the AWE method is a special case of the proposed framework. We also introduce pseudo query vectors, the query embedding vectors estimated using pseudo-relevant documents. We further extrinsically evaluate the proposed methods using two well-known IR tasks: query expansion and query classification. The estimated query embedding vectors are evaluated via query expansion experiments over three newswire and web TREC collections as well as query classification experiments over the KDD Cup 2005 test set. The experiments show that the introduced pseudo query vectors significantly outperform the AWE method.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"104","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2970398.2970403","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 104
Abstract
The dense vector representation of vocabulary terms, also known as word embeddings, have been shown to be highly effective in many natural language processing tasks. Word embeddings have recently begun to be studied in a number of information retrieval (IR) tasks. One of the main steps in leveraging word embeddings for IR tasks is to estimate the embedding vectors of queries. This is a challenging task, since queries are not always available during the training phase of word embedding vectors. Previous work has considered the average or sum of embedding vectors of all query terms (AWE) to model the query embedding vectors, but no theoretical justification has been presented for such a model. In this paper, we propose a theoretical framework for estimating query embedding vectors based on the individual embedding vectors of vocabulary terms. We then provide a number of different implementations of this framework and show that the AWE method is a special case of the proposed framework. We also introduce pseudo query vectors, the query embedding vectors estimated using pseudo-relevant documents. We further extrinsically evaluate the proposed methods using two well-known IR tasks: query expansion and query classification. The estimated query embedding vectors are evaluated via query expansion experiments over three newswire and web TREC collections as well as query classification experiments over the KDD Cup 2005 test set. The experiments show that the introduced pseudo query vectors significantly outperform the AWE method.
词汇术语的密集向量表示,也称为词嵌入,已被证明在许多自然语言处理任务中是非常有效的。近年来,词嵌入在许多信息检索(IR)任务中得到了研究。在IR任务中利用词嵌入的主要步骤之一是估计查询的嵌入向量。这是一项具有挑战性的任务,因为在词嵌入向量的训练阶段,查询并不总是可用的。以前的工作考虑了所有查询项的嵌入向量的平均值或总和(AWE)来建模查询嵌入向量,但没有为这种模型提出理论依据。在本文中,我们提出了一个基于词汇词的单个嵌入向量估计查询嵌入向量的理论框架。然后,我们提供了该框架的许多不同实现,并表明AWE方法是所提议框架的特殊情况。我们还引入了伪查询向量,即使用伪相关文档估计的查询嵌入向量。我们使用两个众所周知的IR任务:查询扩展和查询分类进一步从外部评估所提出的方法。通过三个新闻线和web TREC集合上的查询扩展实验以及KDD Cup 2005测试集上的查询分类实验来评估估计的查询嵌入向量。实验表明,引入的伪查询向量明显优于AWE方法。