On the Complexity of Inner Product Similarity Join

Thomas Dybdahl Ahle, R. Pagh, Ilya P. Razenshteyn, Francesco Silvestri
{"title":"On the Complexity of Inner Product Similarity Join","authors":"Thomas Dybdahl Ahle, R. Pagh, Ilya P. Razenshteyn, Francesco Silvestri","doi":"10.1145/2902251.2902285","DOIUrl":null,"url":null,"abstract":"A number of tasks in classification, information retrieval, recommendation systems, and record linkage reduce to the core problem of inner product similarity join (IPS join): identifying pairs of vectors in a collection that have a sufficiently large inner product. IPS join is well understood when vectors are normalized and some approximation of inner products is allowed. However, the general case where vectors may have any length appears much more challenging. Recently, new upper bounds based on asymmetric locality-sensitive hashing (ALSH) and asymmetric embeddings have emerged, but little has been known on the lower bound side. In this paper we initiate a systematic study of inner product similarity join, showing new lower and upper bounds. Our main results are: Approximation hardness of IPS join in subquadratic time, assuming the strong exponential time hypothesis. New upper and lower bounds for (A)LSH-based algorithms. In particular, we show that asymmetry can be avoided by relaxing the LSH definition to only consider the collision probability of distinct elements. A new indexing method for IPS based on linear sketches, implying that our hardness results are not far from being tight. Our technical contributions include new asymmetric embeddings that may be of independent interest. At the conceptual level we strive to provide greater clarity, for example by distinguishing among signed and unsigned variants of IPS join and shedding new light on the effect of asymmetry.","PeriodicalId":158471,"journal":{"name":"Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2902251.2902285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 45

Abstract

A number of tasks in classification, information retrieval, recommendation systems, and record linkage reduce to the core problem of inner product similarity join (IPS join): identifying pairs of vectors in a collection that have a sufficiently large inner product. IPS join is well understood when vectors are normalized and some approximation of inner products is allowed. However, the general case where vectors may have any length appears much more challenging. Recently, new upper bounds based on asymmetric locality-sensitive hashing (ALSH) and asymmetric embeddings have emerged, but little has been known on the lower bound side. In this paper we initiate a systematic study of inner product similarity join, showing new lower and upper bounds. Our main results are: Approximation hardness of IPS join in subquadratic time, assuming the strong exponential time hypothesis. New upper and lower bounds for (A)LSH-based algorithms. In particular, we show that asymmetry can be avoided by relaxing the LSH definition to only consider the collision probability of distinct elements. A new indexing method for IPS based on linear sketches, implying that our hardness results are not far from being tight. Our technical contributions include new asymmetric embeddings that may be of independent interest. At the conceptual level we strive to provide greater clarity, for example by distinguishing among signed and unsigned variants of IPS join and shedding new light on the effect of asymmetry.
论内积相似连接的复杂度
分类、信息检索、推荐系统和记录链接中的许多任务都归结为内积相似连接(IPS join)的核心问题:识别集合中具有足够大内积的向量对。当向量被归一化并且允许一些近似的内积时,IPS连接是很容易理解的。然而,一般情况下,向量可以有任何长度似乎更具挑战性。近年来,基于非对称位置敏感哈希(ALSH)和非对称嵌入的新的上界已经出现,但对下界知之甚少。本文对内积相似连接进行了系统的研究,给出了新的下界和上界。我们的主要研究结果是:在强指数时间假设下,IPS连接在次二次时间内的近似硬度。(A) lsh算法的新上界和下界。特别地,我们证明了通过放宽LSH定义,只考虑不同元素的碰撞概率,可以避免不对称。本文提出了一种新的基于线性草图的硬度标引方法,这表明我们的硬度结果并不差。我们的技术贡献包括新的非对称嵌入,这可能是独立的兴趣。在概念层面,我们努力提供更大的清晰度,例如,通过区分IPS连接的有符号和无符号变体,并阐明不对称的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信