Similarity-based Retrieval in High Dimensional Data with Recursive Lists of Clusters: A Study Case with Natural Language Dictionaries

2009 International Conference on Information Management and Engineering Pub Date : 2009-04-03 DOI:10.1109/ICIME.2009.12

F. Barbosa

{"title":"Similarity-based Retrieval in High Dimensional Data with Recursive Lists of Clusters: A Study Case with Natural Language Dictionaries","authors":"F. Barbosa","doi":"10.1109/ICIME.2009.12","DOIUrl":null,"url":null,"abstract":"An important issue in similarity-based retrieval in high dimensional data objects is the data representation. In order to use an indexing structure that can effectively handle large databases, it is essential to reduce the dimensionality of the data objects. The symbolic representation of the objects is a promising technique of dimension reduction, which allows researchers to avail from the area of text-retrieval algorithms and techniques. A similar searching engine consists in finding the objects similar to a given objects in some collection. Comparing the given object to every other object in a large database is prohibitively slow. If objects can be placed in a metric space, the search can be sped up by comparing the query object to a restricted number of objects, rather than the entire database. If the objects are strings (text) and a \"good\" metric to compare objects exists, we get a metric space. In order to have efficient similar searching in metric spaces, metric data structures are used. We evaluate the performance of range queries in the Recursive Lists of Clusters (RLC) metric data structure, when the metric spaces are Natural Language Dictionaries with the Extended Edit Distance (EED). The study compares RLC with Vp-Tree data structure in six different dictionaries, which are characterized according to the mean and the variance of the histograms of distances.The experimental results show that RLC has a good performance in all the tested cases and, in some of them it outperforms the Vp-tree data structure. In addition, RLC is the only data structure that always keeps its good performance, when the space dimension is lower or higher.","PeriodicalId":445284,"journal":{"name":"2009 International Conference on Information Management and Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Information Management and Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIME.2009.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

An important issue in similarity-based retrieval in high dimensional data objects is the data representation. In order to use an indexing structure that can effectively handle large databases, it is essential to reduce the dimensionality of the data objects. The symbolic representation of the objects is a promising technique of dimension reduction, which allows researchers to avail from the area of text-retrieval algorithms and techniques. A similar searching engine consists in finding the objects similar to a given objects in some collection. Comparing the given object to every other object in a large database is prohibitively slow. If objects can be placed in a metric space, the search can be sped up by comparing the query object to a restricted number of objects, rather than the entire database. If the objects are strings (text) and a "good" metric to compare objects exists, we get a metric space. In order to have efficient similar searching in metric spaces, metric data structures are used. We evaluate the performance of range queries in the Recursive Lists of Clusters (RLC) metric data structure, when the metric spaces are Natural Language Dictionaries with the Extended Edit Distance (EED). The study compares RLC with Vp-Tree data structure in six different dictionaries, which are characterized according to the mean and the variance of the histograms of distances.The experimental results show that RLC has a good performance in all the tested cases and, in some of them it outperforms the Vp-tree data structure. In addition, RLC is the only data structure that always keeps its good performance, when the space dimension is lower or higher.

查看原文本刊更多论文

基于相似度的高维数据对象检索中的一个重要问题是数据表示。为了使用能够有效处理大型数据库的索引结构，有必要降低数据对象的维数。对象的符号表示是一种很有前途的降维技术，它使研究人员能够从文本检索算法和技术领域中获益。类似的搜索引擎包括在某个集合中查找与给定对象相似的对象。将给定对象与大型数据库中的所有其他对象进行比较的速度非常慢。如果对象可以放置在度量空间中，则可以通过将查询对象与有限数量的对象(而不是整个数据库)进行比较来加快搜索速度。如果对象是字符串(文本)，并且存在一个比较对象的“好”度量，我们就得到一个度量空间。为了在度量空间中进行高效的相似搜索，采用了度量数据结构。当度量空间是具有扩展编辑距离(EED)的自然语言字典时，我们评估了递归聚类列表(RLC)度量数据结构中范围查询的性能。研究比较了六种不同字典中的RLC和Vp-Tree数据结构，这六种字典根据距离直方图的均值和方差来表征。实验结果表明，RLC在所有测试用例中都具有良好的性能，并且在某些用例中优于Vp-tree数据结构。此外，RLC是唯一的数据结构，无论空间维数是高还是低，都能始终保持良好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 International Conference on Information Management and Engineering

自引率

0.00%

发文量