大数据集的弹性检索模型

Dipannita Podder
{"title":"大数据集的弹性检索模型","authors":"Dipannita Podder","doi":"10.1145/3539618.3591793","DOIUrl":null,"url":null,"abstract":"Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds. One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive. Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model. Considering the above research gaps, we focus on the following research questions in this doctoral work. Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching. Research Question 2: How can we identify the outlier terms of the large collection and penalize them in the ranking function? For RQ2, we model the distribution of maximum normalized term frequency values of relevant documents for the terms of a set of queries. Then we estimate the probability that the normalized frequency of a new term is coming from the right extreme of that distribution and uses this probability to penalize them in the ranking function. Research Question 3: How can we detect the bursty terms and incorporate them in the ranking function? To address RQ3, we propose a model that estimates the burstiness score of a term from its information content in a document and use this score to penalize the bursty term in the ranking function. To estimate the information content of a term, we capture the contextual information of each occurrence of a term by utilizing the pre-trained BERT model and estimate the contextual divergence of the occurrence of a term from its previous occurrences.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"151 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Resilient Retrieval Models for Large Collection\",\"authors\":\"Dipannita Podder\",\"doi\":\"10.1145/3539618.3591793\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds. One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive. Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model. Considering the above research gaps, we focus on the following research questions in this doctoral work. Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching. Research Question 2: How can we identify the outlier terms of the large collection and penalize them in the ranking function? For RQ2, we model the distribution of maximum normalized term frequency values of relevant documents for the terms of a set of queries. Then we estimate the probability that the normalized frequency of a new term is coming from the right extreme of that distribution and uses this probability to penalize them in the ranking function. Research Question 3: How can we detect the bursty terms and incorporate them in the ranking function? To address RQ3, we propose a model that estimates the burstiness score of a term from its information content in a document and use this score to penalize the bursty term in the ranking function. To estimate the information content of a term, we capture the contextual information of each occurrence of a term by utilizing the pre-trained BERT model and estimate the contextual divergence of the occurrence of a term from its previous occurrences.\",\"PeriodicalId\":425056,\"journal\":{\"name\":\"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"volume\":\"151 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3539618.3591793\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539618.3591793","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

现代搜索引擎采用多级排序管道来平衡大型集合的检索效率和有效性。这些管道通过一些经济有效的检索模型(如BM25、LM)从大型存储库检索一组初始候选文档,然后通过神经检索模型对这些候选文档重新排序。如果第一阶段排名器达到高召回率[2],这些管道就会表现良好。为了实现这一点,第一阶段的排名应该以毫秒为单位来解决问题。搜索引擎的主要问题之一是查询中存在多余的术语。由于查询文档词匹配是任何检索模型的基本块,因此当文档与这些无关的查询词匹配时,检索效率会下降。现有模型[4,5]通过使用监督方法或利用一组初始排名最高的文档的信息并将其纳入最终排名函数来估计术语的权重,从而解决了这个问题。尽管后一类方法是无监督的,但它们的效率很低,因为对大集合进行排序以获得最初排名靠前的文档的计算成本很高。此外,在真实世界的集合中,由于多种原因,某些术语可能会在文档中出现多次,例如某个术语可能出现在不同的上下文中,作者突然使用了该术语,或者它是一个异常值。因此,现有的检索模型如果在不相关的文档中包含一些频率极高的查询词,则会高估相关分数。Paik等人提出了一种基于截断分布的概率模型,该模型减少了相关评分中高频出现的术语的贡献。但是,截断点选择不利用特定于术语的分布信息。它将所有相关文档视为一组查询的包,这不是捕获术语分布的好方法。此外,这个模型没有捕捉到“爆发”这个词;它只是减少了异常值的影响。Cummins等人提出了一种基于Dirichlet复合多项分布的语言模型,可以捕捉术语的突发性。但是这个模型明确地特定于语言模型。鉴于上述研究空白,我们在本博士工作中重点关注以下研究问题。研究问题1:我们如何在不依赖于初始排名列表或相关性判断的情况下,从冗长查询中识别中心查询词,并修改排名函数,使其能够专注于派生的中心查询词?为了解决RQ1问题,我们使用预训练的BERT(来自变形器的双向编码器表示)模型生成整个查询和单个查询术语的上下文向量,然后分析它们的相关性以估计术语中心性得分,以便在术语匹配时排序函数可以关注中心术语。研究问题2:我们如何识别大型集合中的异常项并在排名函数中惩罚它们?对于RQ2,我们对一组查询的术语的相关文档的最大规范化术语频率值的分布进行建模。然后我们估计一个新项的归一化频率来自该分布的右极值的概率,并使用这个概率在排序函数中惩罚它们。研究问题3:我们如何检测突发术语并将其纳入排名函数?为了解决RQ3问题,我们提出了一个模型,该模型根据文档中的信息内容估计术语的突发性得分,并使用该分数对排名函数中的突发性术语进行惩罚。为了估计术语的信息内容,我们利用预训练的BERT模型捕获术语每次出现的上下文信息,并估计术语与之前出现的上下文差异。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Resilient Retrieval Models for Large Collection
Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds. One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive. Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model. Considering the above research gaps, we focus on the following research questions in this doctoral work. Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching. Research Question 2: How can we identify the outlier terms of the large collection and penalize them in the ranking function? For RQ2, we model the distribution of maximum normalized term frequency values of relevant documents for the terms of a set of queries. Then we estimate the probability that the normalized frequency of a new term is coming from the right extreme of that distribution and uses this probability to penalize them in the ranking function. Research Question 3: How can we detect the bursty terms and incorporate them in the ranking function? To address RQ3, we propose a model that estimates the burstiness score of a term from its information content in a document and use this score to penalize the bursty term in the ranking function. To estimate the information content of a term, we capture the contextual information of each occurrence of a term by utilizing the pre-trained BERT model and estimate the contextual divergence of the occurrence of a term from its previous occurrences.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信