PL2AP: fast parallel cosine similarity search

D. Anastasiu, G. Karypis
{"title":"PL2AP: fast parallel cosine similarity search","authors":"D. Anastasiu, G. Karypis","doi":"10.1145/2833179.2833182","DOIUrl":null,"url":null,"abstract":"Solving the AllPairs similarity search problem entails finding all pairs of vectors in a high dimensional sparse dataset that have a similarity value higher than a given threshold. The output form this problem is a crucial component in many real-world applications, such as clustering, online advertising, recommender systems, near-duplicate document detection, and query refinement. A number of serial algorithms have been proposed that solve the problem by pruning many of the possible similarity candidates for each query object, after accessing only a few of their non-zero values. The pruning process results in unpredictable memory access patterns that can reduce search efficiency. In this context, we introduce pL2AP, which efficiently solves the AllPairs cosine similarity search problem in a multi-core environment. Our method uses a number of cache-tiling optimizations, combined with fine-grained dynamically balanced parallel tasks, to solve the problem 1.5x-238x faster than existing parallel baselines on datasets with hundreds of millions of non-zeros.","PeriodicalId":215872,"journal":{"name":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2833179.2833182","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Solving the AllPairs similarity search problem entails finding all pairs of vectors in a high dimensional sparse dataset that have a similarity value higher than a given threshold. The output form this problem is a crucial component in many real-world applications, such as clustering, online advertising, recommender systems, near-duplicate document detection, and query refinement. A number of serial algorithms have been proposed that solve the problem by pruning many of the possible similarity candidates for each query object, after accessing only a few of their non-zero values. The pruning process results in unpredictable memory access patterns that can reduce search efficiency. In this context, we introduce pL2AP, which efficiently solves the AllPairs cosine similarity search problem in a multi-core environment. Our method uses a number of cache-tiling optimizations, combined with fine-grained dynamically balanced parallel tasks, to solve the problem 1.5x-238x faster than existing parallel baselines on datasets with hundreds of millions of non-zeros.
PL2AP:快速并行余弦相似度搜索
解决AllPairs相似度搜索问题需要在高维稀疏数据集中找到相似度值高于给定阈值的所有向量对。这个问题的输出是许多实际应用程序中的关键组件,例如聚类、在线广告、推荐系统、近重复文档检测和查询细化。已经提出了许多串行算法,通过在访问每个查询对象的少数非零值之后,修剪许多可能的相似性候选者来解决这个问题。修剪过程会导致不可预测的内存访问模式,从而降低搜索效率。在此背景下,我们引入了pL2AP,它有效地解决了多核环境下的AllPairs余弦相似度搜索问题。我们的方法使用了大量的缓存平片优化,结合细粒度动态平衡并行任务,在具有数亿个非零的数据集上解决问题的速度比现有并行基线快1.5 -238倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信