{"title":"Two-level dynamic index pruning","authors":"Jan Friedrich, C. Lindemann, Michael Petrifke","doi":"10.1109/ICDIM.2017.8244656","DOIUrl":null,"url":null,"abstract":"In this paper, we propose two-level dynamic index pruning for improving retrieval efficiency without degrading the quality of query results. Analyzing the ClueWeb09 data set, we observe that most terms appear in thousands of different websites, while internet search engines typically just display the top-10 search results. We conclude that retrieval efficiency would be substantially improved, if one could prune entire websites by knowing that the scores of all their web pages will not make it in the top-10 scores of the query. Thus, two-level dynamic index pruning utilizes a hierarchical document numbering scheme to subdivide posting lists into sorted runs of the pages of one website rather than the flat inverted index of all web pages. Experimental results on the ClueWeb09 data set illustrate the benefits of two-level dynamic index pruning for improving retrieval efficiency.","PeriodicalId":144953,"journal":{"name":"2017 Twelfth International Conference on Digital Information Management (ICDIM)","volume":"172 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Twelfth International Conference on Digital Information Management (ICDIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2017.8244656","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we propose two-level dynamic index pruning for improving retrieval efficiency without degrading the quality of query results. Analyzing the ClueWeb09 data set, we observe that most terms appear in thousands of different websites, while internet search engines typically just display the top-10 search results. We conclude that retrieval efficiency would be substantially improved, if one could prune entire websites by knowing that the scores of all their web pages will not make it in the top-10 scores of the query. Thus, two-level dynamic index pruning utilizes a hierarchical document numbering scheme to subdivide posting lists into sorted runs of the pages of one website rather than the flat inverted index of all web pages. Experimental results on the ClueWeb09 data set illustrate the benefits of two-level dynamic index pruning for improving retrieval efficiency.