{"title":"大数据集的弹性检索模型","authors":"Dipannita Podder","doi":"10.1145/3539618.3591793","DOIUrl":null,"url":null,"abstract":"Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds. One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive. Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model. Considering the above research gaps, we focus on the following research questions in this doctoral work. Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching. Research Question 2: How can we identify the outlier terms of the large collection and penalize them in the ranking function? For RQ2, we model the distribution of maximum normalized term frequency values of relevant documents for the terms of a set of queries. Then we estimate the probability that the normalized frequency of a new term is coming from the right extreme of that distribution and uses this probability to penalize them in the ranking function. Research Question 3: How can we detect the bursty terms and incorporate them in the ranking function? To address RQ3, we propose a model that estimates the burstiness score of a term from its information content in a document and use this score to penalize the bursty term in the ranking function. To estimate the information content of a term, we capture the contextual information of each occurrence of a term by utilizing the pre-trained BERT model and estimate the contextual divergence of the occurrence of a term from its previous occurrences.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"151 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Resilient Retrieval Models for Large Collection\",\"authors\":\"Dipannita Podder\",\"doi\":\"10.1145/3539618.3591793\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds. One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive. Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model. Considering the above research gaps, we focus on the following research questions in this doctoral work. Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching. Research Question 2: How can we identify the outlier terms of the large collection and penalize them in the ranking function? For RQ2, we model the distribution of maximum normalized term frequency values of relevant documents for the terms of a set of queries. Then we estimate the probability that the normalized frequency of a new term is coming from the right extreme of that distribution and uses this probability to penalize them in the ranking function. Research Question 3: How can we detect the bursty terms and incorporate them in the ranking function? To address RQ3, we propose a model that estimates the burstiness score of a term from its information content in a document and use this score to penalize the bursty term in the ranking function. To estimate the information content of a term, we capture the contextual information of each occurrence of a term by utilizing the pre-trained BERT model and estimate the contextual divergence of the occurrence of a term from its previous occurrences.\",\"PeriodicalId\":425056,\"journal\":{\"name\":\"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"volume\":\"151 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3539618.3591793\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539618.3591793","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall [2]. To achieve this, the first-stage ranker should address the problems in milliseconds. One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive. Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al. [3] propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al. [1] propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model. Considering the above research gaps, we focus on the following research questions in this doctoral work. Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching. Research Question 2: How can we identify the outlier terms of the large collection and penalize them in the ranking function? For RQ2, we model the distribution of maximum normalized term frequency values of relevant documents for the terms of a set of queries. Then we estimate the probability that the normalized frequency of a new term is coming from the right extreme of that distribution and uses this probability to penalize them in the ranking function. Research Question 3: How can we detect the bursty terms and incorporate them in the ranking function? To address RQ3, we propose a model that estimates the burstiness score of a term from its information content in a document and use this score to penalize the bursty term in the ranking function. To estimate the information content of a term, we capture the contextual information of each occurrence of a term by utilizing the pre-trained BERT model and estimate the contextual divergence of the occurrence of a term from its previous occurrences.