连接密集与稀疏最大内积搜索

IF 5.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Sebastian Bruch, Franco Maria Nardini, Amir Ingber, Edo Liberty
{"title":"连接密集与稀疏最大内积搜索","authors":"Sebastian Bruch, Franco Maria Nardini, Amir Ingber, Edo Liberty","doi":"10.1145/3665324","DOIUrl":null,"url":null,"abstract":"<p>Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top-\\(k\\) retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top-\\(k\\) retrieval methods. We study clustering-based approximate MIPS where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that clustering-based retrieval serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and explore their potential. First, we cast the clustering-based paradigm as dynamic pruning and turn that insight into a novel organization of the inverted index for approximate MIPS over general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, that is robust to query distributions.</p>","PeriodicalId":50936,"journal":{"name":"ACM Transactions on Information Systems","volume":"5 1","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bridging Dense and Sparse Maximum Inner Product Search\",\"authors\":\"Sebastian Bruch, Franco Maria Nardini, Amir Ingber, Edo Liberty\",\"doi\":\"10.1145/3665324\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top-\\\\(k\\\\) retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top-\\\\(k\\\\) retrieval methods. We study clustering-based approximate MIPS where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that clustering-based retrieval serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and explore their potential. First, we cast the clustering-based paradigm as dynamic pruning and turn that insight into a novel organization of the inverted index for approximate MIPS over general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, that is robust to query distributions.</p>\",\"PeriodicalId\":50936,\"journal\":{\"name\":\"ACM Transactions on Information Systems\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2024-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Information Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3665324\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3665324","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

几十年来,稠密向量和稀疏向量的最大内积搜索(MIPS)一直在分化的文献中独立发展;后者在信息检索中被称为顶层检索(top-\(k\) retrieval)。之所以存在这种二元性,是因为稀疏向量和密集向量服务于不同的最终目标。尽管事实上它们表现的是同一个数学问题。在这项工作中,我们询问密向量的算法能否有效地应用于稀疏向量,尤其是那些违反顶(k)检索方法基础假设的算法。我们研究了基于聚类的近似 MIPS,在这种方法中,向量被划分为聚类,检索时只搜索聚类的一部分。我们对稀疏向量的降维进行了全面分析,并研究了标准和球形 KMeans 分区。我们的实验证明,基于聚类的检索是稀疏 MIPS 的高效解决方案。作为副产品,我们发现了两个研究机会,并探索了它们的潜力。首先,我们将基于聚类的范例视为动态剪枝,并将这一洞察力转化为一种新颖的倒排索引组织,用于一般稀疏向量上的近似 MIPS。其次,我们为具有密集和稀疏子空间的向量的 MIPS 提供了一种统一的机制,它对查询分布具有鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Bridging Dense and Sparse Maximum Inner Product Search

Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top-\(k\) retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top-\(k\) retrieval methods. We study clustering-based approximate MIPS where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that clustering-based retrieval serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and explore their potential. First, we cast the clustering-based paradigm as dynamic pruning and turn that insight into a novel organization of the inverted index for approximate MIPS over general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, that is robust to query distributions.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on Information Systems
ACM Transactions on Information Systems 工程技术-计算机:信息系统
CiteScore
9.40
自引率
14.30%
发文量
165
审稿时长
>12 weeks
期刊介绍: The ACM Transactions on Information Systems (TOIS) publishes papers on information retrieval (such as search engines, recommender systems) that contain: new principled information retrieval models or algorithms with sound empirical validation; observational, experimental and/or theoretical studies yielding new insights into information retrieval or information seeking; accounts of applications of existing information retrieval techniques that shed light on the strengths and weaknesses of the techniques; formalization of new information retrieval or information seeking tasks and of methods for evaluating the performance on those tasks; development of content (text, image, speech, video, etc) analysis methods to support information retrieval and information seeking; development of computational models of user information preferences and interaction behaviors; creation and analysis of evaluation methodologies for information retrieval and information seeking; or surveys of existing work that propose a significant synthesis. The information retrieval scope of ACM Transactions on Information Systems (TOIS) appeals to industry practitioners for its wealth of creative ideas, and to academic researchers for its descriptions of their colleagues'' work.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信