使用r森林近似高维最近邻查询

Proceedings. International Database Engineering and Applications Symposium Pub Date : 2013-10-09 DOI:10.1145/2513591.2513652

Michael Nolen, King-Ip Lin

{"title":"使用r森林近似高维最近邻查询","authors":"Michael Nolen, King-Ip Lin","doi":"10.1145/2513591.2513652","DOIUrl":null,"url":null,"abstract":"Highly efficient query processing on high-dimensional data, while important, is still a challenge nowadays -- as the curse of dimensionality makes efficient solution very difficult. On the other hand, there have been suggestions that it is better off if one can return a solution quickly, that is close enough, to be sufficient. In this paper we will introduce the concept R-Forest, comprised of a set of disjoint R-trees built over the domain of the search space. Each R-tree will store a sub-set of points in a non-overlapping space, which is maintained throughout the life of the forest. Also included are several new features, Median point used for ordering and searching a pruning parameter, as well as restricted access. When all of these are combined together they can be used to answer Approximate Nearest Neighbor queries, returning a result that is an improvement over alternative methods, such as Locality Sensitive Hashing B-Tree (LSB-tree) with the same amount of IO. With our approach to this difficult problem, we are able to handle different data distribution, even taking advantage of the distribution without any additional parameter tuning, scales with increasing dimensionality and most importantly provides the user with some feedback, in terms of lower bound as to the quality of the results.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"16 1","pages":"48-57"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Approximate high-dimensional nearest neighbor queries using R-forests\",\"authors\":\"Michael Nolen, King-Ip Lin\",\"doi\":\"10.1145/2513591.2513652\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Highly efficient query processing on high-dimensional data, while important, is still a challenge nowadays -- as the curse of dimensionality makes efficient solution very difficult. On the other hand, there have been suggestions that it is better off if one can return a solution quickly, that is close enough, to be sufficient. In this paper we will introduce the concept R-Forest, comprised of a set of disjoint R-trees built over the domain of the search space. Each R-tree will store a sub-set of points in a non-overlapping space, which is maintained throughout the life of the forest. Also included are several new features, Median point used for ordering and searching a pruning parameter, as well as restricted access. When all of these are combined together they can be used to answer Approximate Nearest Neighbor queries, returning a result that is an improvement over alternative methods, such as Locality Sensitive Hashing B-Tree (LSB-tree) with the same amount of IO. With our approach to this difficult problem, we are able to handle different data distribution, even taking advantage of the distribution without any additional parameter tuning, scales with increasing dimensionality and most importantly provides the user with some feedback, in terms of lower bound as to the quality of the results.\",\"PeriodicalId\":93615,\"journal\":{\"name\":\"Proceedings. International Database Engineering and Applications Symposium\",\"volume\":\"16 1\",\"pages\":\"48-57\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. International Database Engineering and Applications Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2513591.2513652\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Database Engineering and Applications Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2513591.2513652","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

高维数据的高效查询处理虽然很重要，但目前仍然是一个挑战——因为维度的诅咒使得高效的解决方案非常困难。另一方面，也有人建议，如果能够迅速返回解决方案，那就更好了，这样就足够了。在本文中，我们将引入R-Forest的概念，它由建立在搜索空间域上的一组不相交的r树组成。每棵r树将在一个不重叠的空间中存储一个点的子集，这将在森林的整个生命周期中保持。还包括几个新特性，用于排序和搜索修剪参数的中值点，以及限制访问。当所有这些组合在一起时，它们可用于回答近似最近邻查询，返回的结果优于其他方法，例如具有相同IO量的Locality Sensitive哈希B-Tree (lsdb -tree)。通过我们解决这个难题的方法，我们能够处理不同的数据分布，甚至在没有任何额外参数调整的情况下利用分布，随着维数的增加而扩大，最重要的是为用户提供一些反馈，就结果质量的下限而言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Approximate high-dimensional nearest neighbor queries using R-forests

Highly efficient query processing on high-dimensional data, while important, is still a challenge nowadays -- as the curse of dimensionality makes efficient solution very difficult. On the other hand, there have been suggestions that it is better off if one can return a solution quickly, that is close enough, to be sufficient. In this paper we will introduce the concept R-Forest, comprised of a set of disjoint R-trees built over the domain of the search space. Each R-tree will store a sub-set of points in a non-overlapping space, which is maintained throughout the life of the forest. Also included are several new features, Median point used for ordering and searching a pruning parameter, as well as restricted access. When all of these are combined together they can be used to answer Approximate Nearest Neighbor queries, returning a result that is an improvement over alternative methods, such as Locality Sensitive Hashing B-Tree (LSB-tree) with the same amount of IO. With our approach to this difficult problem, we are able to handle different data distribution, even taking advantage of the distribution without any additional parameter tuning, scales with increasing dimensionality and most importantly provides the user with some feedback, in terms of lower bound as to the quality of the results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. International Database Engineering and Applications Symposium

自引率

0.00%

发文量