{"title":"StatBM25","authors":"Xing Tan, Fanghong Jiang, J. Huang","doi":"10.1145/3234944.3234975","DOIUrl":null,"url":null,"abstract":"In Information Retrieval and Web Search, BM25 is one of the most influential probabilistic retrieval formulas for document weighting and ranking. BM25 involves three parameters $k_1$, $k_3$ and b, which provide scalar approximation and scaling of important document features such asterm frequency, document frequency, anddocument length. We investigate in this paper aggregative and statistical document features for document ranking. Shortly speaking, a statistically adjusted BM25 is used to score in an aggregative way onvirtual documents, which are generated by randomly combining documents from the original collection. The problem size, in the number of virtual documents to be ranked, is an expansion to the problem size of the original problem. As a result, ranking is actually realized through performing statistical sampling. Rejection Sampling, a simple Monte Carlo sampling method is used at present. This new framework is called StatBM25, in emphasizing first the fact that the original problem domain space is K-expanded (a concept to be further explained in the paper); Further, statistical sampling is employed in the model. Empirical studies are performed on several standard test collections, where StatBM25 demonstrates convincingly high degree of both uniqueness and effectiveness compared to BM25. This means, in our belief, that StatBM25 as a statistically smoothed and normalized variant to BM25, might eventually lead to discoveries of useful new statistic measures for document ranking.","PeriodicalId":193631,"journal":{"name":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3234944.3234975","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In Information Retrieval and Web Search, BM25 is one of the most influential probabilistic retrieval formulas for document weighting and ranking. BM25 involves three parameters $k_1$, $k_3$ and b, which provide scalar approximation and scaling of important document features such asterm frequency, document frequency, anddocument length. We investigate in this paper aggregative and statistical document features for document ranking. Shortly speaking, a statistically adjusted BM25 is used to score in an aggregative way onvirtual documents, which are generated by randomly combining documents from the original collection. The problem size, in the number of virtual documents to be ranked, is an expansion to the problem size of the original problem. As a result, ranking is actually realized through performing statistical sampling. Rejection Sampling, a simple Monte Carlo sampling method is used at present. This new framework is called StatBM25, in emphasizing first the fact that the original problem domain space is K-expanded (a concept to be further explained in the paper); Further, statistical sampling is employed in the model. Empirical studies are performed on several standard test collections, where StatBM25 demonstrates convincingly high degree of both uniqueness and effectiveness compared to BM25. This means, in our belief, that StatBM25 as a statistically smoothed and normalized variant to BM25, might eventually lead to discoveries of useful new statistic measures for document ranking.