重新排序索引以加快查询处理速度，同时不损失效率

Australasian Document Computing Symposium Pub Date : 2012-12-05 DOI:10.1145/2407085.2407088

D. Hawking, Timothy Jones

{"title":"重新排序索引以加快查询处理速度，同时不损失效率","authors":"D. Hawking, Timothy Jones","doi":"10.1145/2407085.2407088","DOIUrl":null,"url":null,"abstract":"Following Long and Suel, we empirically investigate the importance of document order in search engines which rank documents using a combination of dynamic (query-dependent) and static (query-independent) scores, and use document-at-a-time (DAAT) processing. When inverted file postings are in collection order, assigning document numbers in order of descending static score supports lossless early termination while maintaining good compression.\n Since static scores may not be available until all documents have been gathered and indexed, we build a tool for reordering an existing index and show that it operates in less than 20% of the original indexing time. We note that this additional cost is easily recouped by savings at query processing time. We compare best early-termination points for several different index orders on three enterprise search collections (a whole-of-government index with two very different query sets, and a collection from a UK university). We also present results for the same orders for ClueWeb09-CatB. Our evaluation focuses on finding results likely to be clicked on by users of Web or website search engines --- Nav and Key results in the TREC 2011 Web Track judging scheme.\n The orderings tested are Original, Reverse, Random, and QIE (descending order of static score). For three enterprise search test sets we find that QIE order can achieve close-to-maximal search effectiveness with much lower computational cost than for other orderings. Additionally, reordering has negligible impact on compressed index size for indexes that contain position information. Our results for an artificial query set against the TREC ClueWeb09 Category B collection are much more equivocal and we canvass possible explanations for future investigation.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Reordering an index to speed query processing without loss of effectiveness\",\"authors\":\"D. Hawking, Timothy Jones\",\"doi\":\"10.1145/2407085.2407088\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Following Long and Suel, we empirically investigate the importance of document order in search engines which rank documents using a combination of dynamic (query-dependent) and static (query-independent) scores, and use document-at-a-time (DAAT) processing. When inverted file postings are in collection order, assigning document numbers in order of descending static score supports lossless early termination while maintaining good compression.\\n Since static scores may not be available until all documents have been gathered and indexed, we build a tool for reordering an existing index and show that it operates in less than 20% of the original indexing time. We note that this additional cost is easily recouped by savings at query processing time. We compare best early-termination points for several different index orders on three enterprise search collections (a whole-of-government index with two very different query sets, and a collection from a UK university). We also present results for the same orders for ClueWeb09-CatB. Our evaluation focuses on finding results likely to be clicked on by users of Web or website search engines --- Nav and Key results in the TREC 2011 Web Track judging scheme.\\n The orderings tested are Original, Reverse, Random, and QIE (descending order of static score). For three enterprise search test sets we find that QIE order can achieve close-to-maximal search effectiveness with much lower computational cost than for other orderings. Additionally, reordering has negligible impact on compressed index size for indexes that contain position information. Our results for an artificial query set against the TREC ClueWeb09 Category B collection are much more equivocal and we canvass possible explanations for future investigation.\",\"PeriodicalId\":402985,\"journal\":{\"name\":\"Australasian Document Computing Symposium\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Australasian Document Computing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2407085.2407088\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australasian Document Computing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2407085.2407088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

继Long和Suel之后，我们实证研究了文档顺序在搜索引擎中的重要性，该搜索引擎使用动态(依赖于查询)和静态(独立于查询)分数的组合对文档进行排名，并使用每次文档(DAAT)处理。当倒置的文件发布按集合顺序排列时，按静态分数降序分配文档号支持无损的早期终止，同时保持良好的压缩。由于静态分数可能在所有文档都被收集和索引之后才可用，因此我们构建了一个工具，用于对现有索引进行重新排序，并表明它的运行时间不到原始索引时间的20%。我们注意到，这个额外的成本很容易通过查询处理时间的节省来弥补。我们比较了三个企业搜索集合(一个具有两个非常不同查询集的整个政府索引和一个来自英国大学的集合)上几个不同索引顺序的最佳早期终止点。我们还提供了ClueWeb09-CatB相同订单的结果。我们的评估侧重于寻找可能被网络或网站搜索引擎用户点击的结果——TREC 2011 Web Track判断方案中的导航和关键结果。测试的顺序为Original, Reverse, Random, QIE(静态分数降序)。对于三个企业搜索测试集，我们发现与其他排序相比，QIE排序能够以更低的计算成本获得接近最大的搜索效率。此外，对于包含位置信息的索引，重新排序对压缩索引大小的影响可以忽略不计。我们针对TREC ClueWeb09类别B集合的人工查询集的结果更加模棱两可，我们为未来的调查寻找可能的解释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Reordering an index to speed query processing without loss of effectiveness

Following Long and Suel, we empirically investigate the importance of document order in search engines which rank documents using a combination of dynamic (query-dependent) and static (query-independent) scores, and use document-at-a-time (DAAT) processing. When inverted file postings are in collection order, assigning document numbers in order of descending static score supports lossless early termination while maintaining good compression. Since static scores may not be available until all documents have been gathered and indexed, we build a tool for reordering an existing index and show that it operates in less than 20% of the original indexing time. We note that this additional cost is easily recouped by savings at query processing time. We compare best early-termination points for several different index orders on three enterprise search collections (a whole-of-government index with two very different query sets, and a collection from a UK university). We also present results for the same orders for ClueWeb09-CatB. Our evaluation focuses on finding results likely to be clicked on by users of Web or website search engines --- Nav and Key results in the TREC 2011 Web Track judging scheme. The orderings tested are Original, Reverse, Random, and QIE (descending order of static score). For three enterprise search test sets we find that QIE order can achieve close-to-maximal search effectiveness with much lower computational cost than for other orderings. Additionally, reordering has negligible impact on compressed index size for indexes that contain position information. Our results for an artificial query set against the TREC ClueWeb09 Category B collection are much more equivocal and we canvass possible explanations for future investigation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Australasian Document Computing Symposium

自引率

0.00%

发文量