Permutation indexing: fast approximate retrieval from large corpora

M. Gurevich, Tamás Sarlós
{"title":"Permutation indexing: fast approximate retrieval from large corpora","authors":"M. Gurevich, Tamás Sarlós","doi":"10.1145/2505515.2505646","DOIUrl":null,"url":null,"abstract":"Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505515.2505646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.
排列索引:从大型语料库快速近似检索
倒排索引在包括网络搜索在内的检索系统中是一种普遍使用的技术。尽管它很受欢迎,但它有一个缺点,即查询检索时间是高度可变的,并且随着语料库的大小而增长。在这项工作中,我们提出了一种替代技术,排列索引,其中检索成本是严格限定的,并且仅对语料库大小有对数依赖。我们的方法基于两种新技术:(a)将术语空间划分为查询中经常出现的重叠术语簇,以及(b)将集群中由术语组成的所有查询的结果紧凑编码为连续的文档id序列的数据结构。然后,通过获取这些序列的一小部分来检索查询结果。这是有代价的:我们的编码是有损的,因此返回近似的结果集。返回的真实结果的比例,召回率,是由冗余级别控制的。为排列索引分配的空间越多,召回率就越高。我们在理论上分析了简化文档和查询模型下的排列索引,并在实际文档和查询集合上进行了经验分析。我们表明,虽然排列索引不能取代传统的检索方法,因为不能保证对所有查询都有高召回率,但它覆盖了高达77%的尾部查询,可以用来加快这些查询的检索速度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信