设置超越MinHash的相似性搜索

Tobias Christiani, R. Pagh
{"title":"设置超越MinHash的相似性搜索","authors":"Tobias Christiani, R. Pagh","doi":"10.1145/3055399.3055443","DOIUrl":null,"url":null,"abstract":"We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":"{\"title\":\"Set similarity search beyond MinHash\",\"authors\":\"Tobias Christiani, R. Pagh\",\"doi\":\"10.1145/3055399.3055443\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).\",\"PeriodicalId\":20615,\"journal\":{\"name\":\"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"41\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3055399.3055443\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3055399.3055443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 41

摘要

考虑Braun-Blanquet相似度B(x, y) = |x∩y| / max(|x|, |y|)下的近似集相似度搜索问题。(b1, b2)近似布朗-布兰凯相似搜索问题是对集合P进行预处理,使得给定一个查询集q,如果存在x Ε P且B(q, x)≥b1,则我们可以有效地返回x ' Ε P且B(q, x ') > b2。我们提出了一个简单的数据结构来解决这个问题,它的空间使用为O(n1+ρlogn +∑x ε P|x|),查询时间为O(|q|nρ logn),其中n = |P|, ρ = log(1/b1)/log(1/b2)。利用O'Donnell等人(TOCT 2014)对位置敏感哈希的现有下界,我们表明ρ的这个值在参数空间上是紧的,即对于常数0 < b2 < b1 < 1的每一个选择。在所有集合具有相同大小的情况下,我们的解决方案严格改进了ρ值,ρ值可以通过使用最先进的数据独立技术在Indyk-Motwani位置敏感散列框架(STOC 1998)中获得,例如Broder的MinHash (CCS 1997)用于Jaccard相似性和Andoni等人的交叉多面体LSH (NIPS 2015)用于余弦相似性。令人惊讶的是,尽管我们的解决方案是数据独立的,但在很大一部分参数空间中,我们的性能优于目前最好的由Andoni和Razenshteyn (STOC 2015)提出的数据依赖方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Set similarity search beyond MinHash
We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信