Incorporating query term dependencies in language models for document retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval Pub Date : 2003-07-28 DOI:10.1145/860435.860523

Munirathnam Srikanth, R. Srihari

{"title":"Incorporating query term dependencies in language models for document retrieval","authors":"Munirathnam Srikanth, R. Srihari","doi":"10.1145/860435.860523","DOIUrl":null,"url":null,"abstract":"Recent advances in Information Retrieval are based on using Statistical Language Models (SLM) for representing documents and evaluating their relevance to user queries [6, 3, 4]. Language Modeling (LM) has been explored in many natural language tasks including machine translation and speech recognition [1]. In LM approach to document retrieval, each document, D, is viewed to have its own language model, MD. Given a query, Q, documents are ranked based on the probability, P (Q|MD), of their language model generating the query. While the LM approach to information retrieval has been motivated from different perspectives [3, 4], most experiments have used smoothed unigram language models that assume term independence for estimating document language models. N-gram, specifically, bigram language models that capture context provided by the previous word(s) perform better than unigram models [7]. Biterm language models [8] that ignore the word order constraint in bigram language models have been shown to perform better than bigram models. However, word order constraint cannot always be relaxed since a blind venetian is not a venetian blind. Term dependencies can be measured using their co-occurrence statistics. Nallapati and Allan [5] represent term dependencies in a sentence using a maximum spanning tree and generate a sentence tree language model for the story link detection task in TDT. Syntactic parse of user queries can provide clues for when the word order constraint can be relaxed. Syn-","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/860435.860523","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

Recent advances in Information Retrieval are based on using Statistical Language Models (SLM) for representing documents and evaluating their relevance to user queries [6, 3, 4]. Language Modeling (LM) has been explored in many natural language tasks including machine translation and speech recognition [1]. In LM approach to document retrieval, each document, D, is viewed to have its own language model, MD. Given a query, Q, documents are ranked based on the probability, P (Q|MD), of their language model generating the query. While the LM approach to information retrieval has been motivated from different perspectives [3, 4], most experiments have used smoothed unigram language models that assume term independence for estimating document language models. N-gram, specifically, bigram language models that capture context provided by the previous word(s) perform better than unigram models [7]. Biterm language models [8] that ignore the word order constraint in bigram language models have been shown to perform better than bigram models. However, word order constraint cannot always be relaxed since a blind venetian is not a venetian blind. Term dependencies can be measured using their co-occurrence statistics. Nallapati and Allan [5] represent term dependencies in a sentence using a maximum spanning tree and generate a sentence tree language model for the story link detection task in TDT. Syntactic parse of user queries can provide clues for when the word order constraint can be relaxed. Syn-

查看原文本刊更多论文

在用于文档检索的语言模型中合并查询词依赖关系

信息检索的最新进展是基于使用统计语言模型(SLM)来表示文档并评估它们与用户查询的相关性[6,3,4]。语言建模(LM)已经在许多自然语言任务中进行了探索，包括机器翻译和语音识别[1]。在文档检索的LM方法中，每个文档D被视为具有自己的语言模型MD。给定一个查询Q，文档根据生成该查询的语言模型的概率P (Q|MD)进行排名。虽然LM方法的信息检索已经从不同的角度得到了激励[3,4]，但大多数实验都使用平滑的一元语言模型来估计文档语言模型，该模型假设术语独立。N-gram，具体地说，捕获前一个单词提供的上下文的双字母语言模型比单字母模型表现得更好[7]。双元语言模型[8]忽略了双元语言模型中的词序约束，其表现优于双元语言模型。然而，词序约束不能总是放松，因为盲目的威尼斯人并不是威尼斯人。术语依赖性可以使用它们的共现统计来度量。Nallapati和Allan[5]使用最大生成树表示句子中的术语依赖关系，并为TDT中的故事链接检测任务生成了句子树语言模型。用户查询的语法解析可以为何时可以放松词序限制提供线索。Syn -

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

自引率

0.00%

发文量