Set similarity search beyond MinHash

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing Pub Date : 2016-12-22 DOI:10.1145/3055399.3055443

Tobias Christiani, R. Pagh

{"title":"Set similarity search beyond MinHash","authors":"Tobias Christiani, R. Pagh","doi":"10.1145/3055399.3055443","DOIUrl":null,"url":null,"abstract":"We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3055399.3055443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

Abstract

We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).

查看原文本刊更多论文

设置超越MinHash的相似性搜索

考虑Braun-Blanquet相似度B(x, y) = |x∩y| / max(|x|， |y|)下的近似集相似度搜索问题。(b1, b2)近似布朗-布兰凯相似搜索问题是对集合P进行预处理，使得给定一个查询集q，如果存在x Ε P且B(q, x)≥b1，则我们可以有效地返回x ' Ε P且B(q, x ') > b2。我们提出了一个简单的数据结构来解决这个问题，它的空间使用为O(n1+ρlogn +∑x ε P|x|)，查询时间为O(|q|nρ logn)，其中n = |P|， ρ = log(1/b1)/log(1/b2)。利用O'Donnell等人(TOCT 2014)对位置敏感哈希的现有下界，我们表明ρ的这个值在参数空间上是紧的，即对于常数0 < b2 < b1 < 1的每一个选择。在所有集合具有相同大小的情况下，我们的解决方案严格改进了ρ值，ρ值可以通过使用最先进的数据独立技术在Indyk-Motwani位置敏感散列框架(STOC 1998)中获得，例如Broder的MinHash (CCS 1997)用于Jaccard相似性和Andoni等人的交叉多面体LSH (NIPS 2015)用于余弦相似性。令人惊讶的是，尽管我们的解决方案是数据独立的，但在很大一部分参数空间中，我们的性能优于目前最好的由Andoni和Razenshteyn (STOC 2015)提出的数据依赖方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

自引率

0.00%

发文量