{"title":"Combinatorial Framework for Similarity Search","authors":"Y. Lifshits","doi":"10.1109/SISAP.2009.31","DOIUrl":null,"url":null,"abstract":"We present an overview of combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points X,Y,Z answers whether Y or Z is closer to X. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if X is the A'th most similar object to Y and Y is the B'th most similar object to Z, then X is among the D(A+B) most similar objects to Z, where D is a relatively small disorder constant. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. Ranwalk, the first known combinatorial solution for nearest neighbors, is randomized, exact, zero-error algorithm with query time that is logarithmic in number of objects. But Ranwalk preprocessing time is quadratic. Later on, another solution, called combinatorial nets, was discovered. It is deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and near-logarithmic time complexity of search. Combinatorial nets also have a number of side applications. For near-duplicate detection they lead to the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output. For any dataset with small disorder combinatorial nets can be used to construct a visibility graph: the one in which greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known work-around for Navarro's impossibility of generalizing Delaunay graphs.","PeriodicalId":130242,"journal":{"name":"2009 Second International Workshop on Similarity Search and Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Second International Workshop on Similarity Search and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SISAP.2009.31","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
We present an overview of combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points X,Y,Z answers whether Y or Z is closer to X. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if X is the A'th most similar object to Y and Y is the B'th most similar object to Z, then X is among the D(A+B) most similar objects to Z, where D is a relatively small disorder constant. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. Ranwalk, the first known combinatorial solution for nearest neighbors, is randomized, exact, zero-error algorithm with query time that is logarithmic in number of objects. But Ranwalk preprocessing time is quadratic. Later on, another solution, called combinatorial nets, was discovered. It is deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and near-logarithmic time complexity of search. Combinatorial nets also have a number of side applications. For near-duplicate detection they lead to the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output. For any dataset with small disorder combinatorial nets can be used to construct a visibility graph: the one in which greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known work-around for Navarro's impossibility of generalizing Delaunay graphs.
我们提出了相似度搜索的组合框架的概述。如果只允许对两个相似值进行直接比较,则算法是组合的。即输入数据集由比较甲骨文表示,鉴于任何三个点X, Y, Z的答案是否接近X Y或Z我们假设数据集的相似性订单满足下列四种变体障碍不平等:如果X是一个最相似的对象Y, Y是第B Z最相似的对象,那么X是D (a + B)最相似对象Z, D是一个相对较小的障碍常数。最近邻搜索的组合算法有两个重要的优点:(1)它们不将相似值映射到人工距离值,后者不使用三角不等式;(2)它们适用于任意复杂的数据表示和相似函数。Ranwalk是已知的第一个最近邻的组合解,它是一种随机、精确、零误差的算法,查询时间是对象数量的对数。但Ranwalk预处理时间是二次的。后来,另一种解决方案被发现,称为组合网络。它是一种确定性的精确算法,预处理的时间和空间复杂度几乎为线性,搜索的时间复杂度接近对数。组合网络也有许多侧面应用。对于近重复检测,它们导致了第一个已知的确定性算法,该算法只需要近线性时间+与输出大小成比例的时间。对于任何具有小无序的数据集,组合网络都可以用来构造可见性图:贪婪路由以对数步数确定性地收敛到目标的最近邻居。后来的结果是第一个已知的解决纳瓦罗不可能推广德劳内图的方法。