{"title":"Fast Similarity Sketching","authors":"Søren Dahlgaard, M. B. T. Knudsen, M. Thorup","doi":"10.1109/FOCS.2017.67","DOIUrl":null,"url":null,"abstract":"We consider the Similarity Sketching problem: Given a universe [u] = {0,..., u-1} we want a random function S mapping subsets A of [u] into vectors S(A) of size t, such that similarity is preserved. More precisely: Given subsets A,B of [u], define X_i = [S(A)[i] = S(B)[i]] and X = sum_{i in [t]} X_i. We want to have E[X] = t*J(A,B), where J(A,B) = |A intersect B|/|A union B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for X. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A) are also called sketches. The seminal t x MinHash algorithm uses t random hash functions h_1,..., h_t, and stores (min_{a in A} h_1(A),..., min_{a in A} h_t(A)) as the sketch of A. The main drawback of MinHash is, however, its O(t*|A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS12] introduced one permutation hashing (OPH), which creates a sketch of size t in O(t + |A|) time, but with the drawback that possibly some of the t entries are empty when |A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML14]. However, these densification schemes fail to provide good concentration bounds exactly in the case |A| = O(t), where they are needed. In this paper we present a new sketch which obtains essentially the best of both worlds. That is, a fast O(t log t + |A|) expected running time while getting the same strong concentration bounds as MinHash. Our new sketch can be seen as a mix between sampling with replacement and sampling without replacement. We demonstrate the power of our new sketch by considering popular applications in large-scale classification with linear SVM as introduced by Li et al. [NIPS11] as well as approximate similarity search using the LSH framework of Indyk and Motwani [STOC98]. In particular, for the j_1, j_2-approximate similarity search problem on a collection of n sets we obtain a data-structure with space usage O(n^{1+rho} + sum_{A in C} |A|) and O(n^rho * log n + |Q|) expected time for querying a set Q compared to a O(n^rho * log n * |Q|) expected query time of the classic result of Indyk and Motwani.","PeriodicalId":311592,"journal":{"name":"2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2017.67","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28
Abstract
We consider the Similarity Sketching problem: Given a universe [u] = {0,..., u-1} we want a random function S mapping subsets A of [u] into vectors S(A) of size t, such that similarity is preserved. More precisely: Given subsets A,B of [u], define X_i = [S(A)[i] = S(B)[i]] and X = sum_{i in [t]} X_i. We want to have E[X] = t*J(A,B), where J(A,B) = |A intersect B|/|A union B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for X. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A) are also called sketches. The seminal t x MinHash algorithm uses t random hash functions h_1,..., h_t, and stores (min_{a in A} h_1(A),..., min_{a in A} h_t(A)) as the sketch of A. The main drawback of MinHash is, however, its O(t*|A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS12] introduced one permutation hashing (OPH), which creates a sketch of size t in O(t + |A|) time, but with the drawback that possibly some of the t entries are empty when |A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML14]. However, these densification schemes fail to provide good concentration bounds exactly in the case |A| = O(t), where they are needed. In this paper we present a new sketch which obtains essentially the best of both worlds. That is, a fast O(t log t + |A|) expected running time while getting the same strong concentration bounds as MinHash. Our new sketch can be seen as a mix between sampling with replacement and sampling without replacement. We demonstrate the power of our new sketch by considering popular applications in large-scale classification with linear SVM as introduced by Li et al. [NIPS11] as well as approximate similarity search using the LSH framework of Indyk and Motwani [STOC98]. In particular, for the j_1, j_2-approximate similarity search problem on a collection of n sets we obtain a data-structure with space usage O(n^{1+rho} + sum_{A in C} |A|) and O(n^rho * log n + |Q|) expected time for querying a set Q compared to a O(n^rho * log n * |Q|) expected query time of the classic result of Indyk and Motwani.
我们考虑相似草图问题:给定一个宇宙[u] ={0,…, u-1}我们需要一个随机函数S将[u]的子集a映射到大小为t的向量S(a),使得相似性保持不变。更精确地说:给定[u]的子集A,B,定义X_i = [S(A)[i] = S(B)[i]]和X = sum_{i in [t]} X_i。我们希望E[X] = t*J(A,B),其中J(A,B) = |A相交B|/|A并集B|,并且对X具有强集中保证(即chernoff式边界)。这是一个基本问题,通过经典的MinHash算法在数据挖掘,大规模分类,计算机视觉,相似性搜索等方面找到了许多应用。向量S(A)也称为草图。开创性的t x MinHash算法使用t个随机哈希函数h_1,…, h_t, and stores (min_{a in a} h_1(a),…)然而,MinHash的主要缺点是它的运行时间为O(t*| a |),并且寻找具有类似属性和更快运行时间的草图已经成为几篇论文的主题。为了解决这个问题,Li等人[NIPS12]引入了一种排列哈希(OPH),它在O(t + | a |)时间内创建了一个大小为t的草图,但缺点是当| a | = O(t)时,t项中可能有一些是空的。有人可能会说,在这种情况下没有必要绘制草图,然而,大多数应用程序都希望有一个适用于所有大小集合的草图过程。因此,填写这些空白条目是Shrivastava和Li [ICML14]等后续几篇论文的主题。然而,这些致密化方案在需要的情况下,却不能提供良好的浓度界限。在本文中,我们提出了一个新的草图,它基本上获得了两个世界的优点。也就是说,快速的O(t log t + | a |)预期运行时间,同时获得与MinHash相同的强集中界限。我们的新草图可以看作是抽样替换和抽样不替换的混合。通过考虑Li等人[NIPS11]引入的线性支持向量机在大规模分类中的流行应用,以及使用Indyk和Motwani [STOC98]的LSH框架的近似相似性搜索,我们展示了新草图的强大功能。特别地,对于n个集合集合上的j_1, j_2-近似相似搜索问题,我们得到了一个空间利用率为O(n^{1+rho} + sum_{a In C} | a |)的数据结构,与Indyk和Motwani经典结果的O(n^rho * log n * |Q|)的期望查询时间相比,查询集合Q的期望查询时间为O(n^rho * log n * |Q|)。