{"title":"A Probabilistic Model for Information Retrieval Based on Maximum Value Distribution","authors":"Jiaul H. Paik","doi":"10.1145/2766462.2767762","DOIUrl":null,"url":null,"abstract":"The main goal of a retrieval model is to measure the degree of relevance of a document with respect to the given query. Probabilistic models are widely used to measure the likelihood of relevance of a document by combining within document term frequency and term specificity in a formal way. Recent research shows that tf normalization that factors in multiple aspects of term salience is an effective scheme. However, existing models do not fully utilize these tf normalization components in a principled way. Moreover, most state of the art models ignore the distribution of a term in the part of the collection that contains the term. In this article, we introduce a new probabilistic model of ranking that addresses the above issues. We argue that, since the relevance of a document increases with the frequency of the query term, this assumption can be used to measure the likelihood that the normalized frequency of a term in a particular document will be maximum with respect to its distribution in the elite set. Thus, the weight of a term in a document is proportional to the probability that the normalized frequency of that term is maximum under the hypothesis that the frequencies are generated randomly. To that end, we introduce a ranking function based on maximum value distribution that uses two aspects of tf normalization. The merit of the proposed model is demonstrated on a number of recent large web collections. Results show that the proposed model outperforms the state of the art models by significantly large margin.","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2766462.2767762","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
The main goal of a retrieval model is to measure the degree of relevance of a document with respect to the given query. Probabilistic models are widely used to measure the likelihood of relevance of a document by combining within document term frequency and term specificity in a formal way. Recent research shows that tf normalization that factors in multiple aspects of term salience is an effective scheme. However, existing models do not fully utilize these tf normalization components in a principled way. Moreover, most state of the art models ignore the distribution of a term in the part of the collection that contains the term. In this article, we introduce a new probabilistic model of ranking that addresses the above issues. We argue that, since the relevance of a document increases with the frequency of the query term, this assumption can be used to measure the likelihood that the normalized frequency of a term in a particular document will be maximum with respect to its distribution in the elite set. Thus, the weight of a term in a document is proportional to the probability that the normalized frequency of that term is maximum under the hypothesis that the frequencies are generated randomly. To that end, we introduce a ranking function based on maximum value distribution that uses two aspects of tf normalization. The merit of the proposed model is demonstrated on a number of recent large web collections. Results show that the proposed model outperforms the state of the art models by significantly large margin.