Curse of Dimensionality in Pivot Based Indexes

2009 Second International Workshop on Similarity Search and Applications Pub Date : 2009-06-02 DOI:10.1109/SISAP.2009.9

I. Volnyansky, V. Pestov

引用次数: 28

Abstract

We offer a theoretical validation of the curse of dimensionality in the pivot-based indexing of datasets for similarity search, by proving, in the framework of statistical learning, that in high dimensions no pivot-based indexing scheme can essentially outperform the linear scan. A study of the asymptotic performance of pivot-based indexing schemes is performed on a sequence of datasets modeled as samples picked in i.i.d. fashion from a sequence of metric spaces. We allow the size of the dataset to grow in relation to dimension, such that the dimension is superlogarithmic but subpolynomial in the size of the dataset. The number of pivots is sublinear in the size of the dataset. We pick the least restrictive cost model of similarity search where we count each distance calculation as a single computation and disregard the rest. We demonstrate that if the intrinsic dimension of the spaces in the sense of concentration of measure phenomenon is linear in dimension, then the performance of similarity search pivot-based indexes is asymptotically linear in the size of the dataset.

查看原文本刊更多论文

基于枢轴的索引中的维数诅咒

我们通过在统计学习的框架中证明，在高维情况下，没有任何基于枢轴的索引方案能够在本质上优于线性扫描，从而对用于相似性搜索的数据集的基于枢轴索引中的维度诅咒提供了理论验证。研究了基于枢纽的索引方案的渐近性能，对数据集序列进行了建模，作为从度量空间序列中以i.i.d方式选取的样本。我们允许数据集的大小随着维度的增长而增长，这样维度是超对数的，但是数据集的大小是次多项式的。在数据集的大小中，枢轴的数量是次线性的。我们选择限制最小代价的相似搜索模型，其中我们将每个距离计算作为单个计算，而忽略其他计算。我们证明了如果度量现象集中意义上的空间的内在维数在维数上是线性的，那么基于相似度搜索的索引的性能在数据集的大小上是渐近线性的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 Second International Workshop on Similarity Search and Applications

自引率

0.00%

发文量