ZigZag: Supporting Similarity Queries on Vector Space Models

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI:10.1145/3183713.3196936

Wenhai Li, Lingfeng Deng, Yang Li, Chen Li

引用次数: 3

Abstract

In this paper we study the problem of supporting similarity queries on a large number of records using a vector space model, where each record is a bag of tokens. We consider similarity functions that incorporate non-negative global token weights as well as record-specific token degrees. We develop a family of algorithms based on an inverted index for large data sets, especially for the case of using external storage such as hard disks or flash drives, and present pruning techniques based on various bounds to improve their performance. We formally prove the correctness of these techniques, and show how to achieve better pruning power by iteratively tightening these bounds to exactly filter dissimilar records. We conduct an extensive experimental study using real, large-scale data sets based on different storage platforms, including memory, hard disks, and flash drives. The results show that these algorithms and techniques can efficiently support similarity queries on large data sets.

查看原文本刊更多论文

ZigZag:支持向量空间模型上的相似性查询

在本文中，我们使用向量空间模型研究了在大量记录上支持相似性查询的问题，其中每个记录都是一个令牌包。我们考虑了包含非负全局令牌权重以及特定于记录的令牌度的相似函数。我们开发了一系列基于大型数据集倒排索引的算法，特别是对于使用外部存储(如硬盘或闪存驱动器)的情况，并提出了基于各种界限的修剪技术来提高其性能。我们正式证明了这些技术的正确性，并展示了如何通过迭代收紧这些边界来精确过滤不相似的记录来获得更好的修剪能力。我们使用基于不同存储平台(包括内存、硬盘和闪存驱动器)的真实大规模数据集进行了广泛的实验研究。结果表明，这些算法和技术能够有效地支持大型数据集上的相似度查询。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 International Conference on Management of Data

自引率

0.00%

发文量