{"title":"Similarity-based Retrieval in High Dimensional Data with Recursive Lists of Clusters: A Study Case with Natural Language Dictionaries","authors":"F. Barbosa","doi":"10.1109/ICIME.2009.12","DOIUrl":null,"url":null,"abstract":"An important issue in similarity-based retrieval in high dimensional data objects is the data representation. In order to use an indexing structure that can effectively handle large databases, it is essential to reduce the dimensionality of the data objects. The symbolic representation of the objects is a promising technique of dimension reduction, which allows researchers to avail from the area of text-retrieval algorithms and techniques. A similar searching engine consists in finding the objects similar to a given objects in some collection. Comparing the given object to every other object in a large database is prohibitively slow. If objects can be placed in a metric space, the search can be sped up by comparing the query object to a restricted number of objects, rather than the entire database. If the objects are strings (text) and a \"good\" metric to compare objects exists, we get a metric space. In order to have efficient similar searching in metric spaces, metric data structures are used. We evaluate the performance of range queries in the Recursive Lists of Clusters (RLC) metric data structure, when the metric spaces are Natural Language Dictionaries with the Extended Edit Distance (EED). The study compares RLC with Vp-Tree data structure in six different dictionaries, which are characterized according to the mean and the variance of the histograms of distances.The experimental results show that RLC has a good performance in all the tested cases and, in some of them it outperforms the Vp-tree data structure. In addition, RLC is the only data structure that always keeps its good performance, when the space dimension is lower or higher.","PeriodicalId":445284,"journal":{"name":"2009 International Conference on Information Management and Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Information Management and Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIME.2009.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Similarity-based Retrieval in High Dimensional Data with Recursive Lists of Clusters: A Study Case with Natural Language Dictionaries
An important issue in similarity-based retrieval in high dimensional data objects is the data representation. In order to use an indexing structure that can effectively handle large databases, it is essential to reduce the dimensionality of the data objects. The symbolic representation of the objects is a promising technique of dimension reduction, which allows researchers to avail from the area of text-retrieval algorithms and techniques. A similar searching engine consists in finding the objects similar to a given objects in some collection. Comparing the given object to every other object in a large database is prohibitively slow. If objects can be placed in a metric space, the search can be sped up by comparing the query object to a restricted number of objects, rather than the entire database. If the objects are strings (text) and a "good" metric to compare objects exists, we get a metric space. In order to have efficient similar searching in metric spaces, metric data structures are used. We evaluate the performance of range queries in the Recursive Lists of Clusters (RLC) metric data structure, when the metric spaces are Natural Language Dictionaries with the Extended Edit Distance (EED). The study compares RLC with Vp-Tree data structure in six different dictionaries, which are characterized according to the mean and the variance of the histograms of distances.The experimental results show that RLC has a good performance in all the tested cases and, in some of them it outperforms the Vp-tree data structure. In addition, RLC is the only data structure that always keeps its good performance, when the space dimension is lower or higher.