Metric Index: An Efficient and Scalable Solution for Similarity Search

2009 Second International Workshop on Similarity Search and Applications Pub Date : 2009-08-29 DOI:10.1109/SISAP.2009.26

David Novak, Michal Batko

{"title":"Metric Index: An Efficient and Scalable Solution for Similarity Search","authors":"David Novak, Michal Batko","doi":"10.1109/SISAP.2009.26","DOIUrl":null,"url":null,"abstract":"Metric space as a universal and versatile model of similarity can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index), that employs practically all known principles of metric space partitioning, pruning and filtering. The heart of the M-Index is a general mapping mechanism that enables to actually store the data in well-established structures such as the B+-tree or even in a distributed storage. We have implemented the M-Index with B+-tree and performed experiments on a combination of five MPEG-7 descriptors in a database of hundreds of thousands digital images. The experiments put under test several M-Index variants and compare them with two orthogonal approaches – the PM-Tree and the iDistance. The trials show that the M-Index outperforms the others in terms of efficiency of search-space pruning, I/O costs, and response times for precise similarity queries. Furthermore, the M-Index demonstrates an excellent ability to keep similar data close in the index which makes its approximation algorithm very efficient – maintaining practically constant response times while preserving a very high recall as the dataset grows.","PeriodicalId":130242,"journal":{"name":"2009 Second International Workshop on Similarity Search and Applications","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"52","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Second International Workshop on Similarity Search and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SISAP.2009.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 52

Abstract

Metric space as a universal and versatile model of similarity can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index), that employs practically all known principles of metric space partitioning, pruning and filtering. The heart of the M-Index is a general mapping mechanism that enables to actually store the data in well-established structures such as the B+-tree or even in a distributed storage. We have implemented the M-Index with B+-tree and performed experiments on a combination of five MPEG-7 descriptors in a database of hundreds of thousands digital images. The experiments put under test several M-Index variants and compare them with two orthogonal approaches – the PM-Tree and the iDistance. The trials show that the M-Index outperforms the others in terms of efficiency of search-space pruning, I/O costs, and response times for precise similarity queries. Furthermore, the M-Index demonstrates an excellent ability to keep similar data close in the index which makes its approximation algorithm very efficient – maintaining practically constant response times while preserving a very high recall as the dataset grows.

查看原文本刊更多论文

度量索引:一种高效、可扩展的相似度搜索解决方案

度量空间作为一种通用的、通用的相似度模型，可以应用于非文本信息检索的各个领域。然而，一个通用的、高效的、可扩展的度量数据管理解决方案仍然是一个具有挑战性的研究课题。我们引入了一种新的索引和搜索机制，称为度量索引(M-Index)，它几乎采用了所有已知的度量空间划分、修剪和过滤原理。M-Index的核心是一种通用的映射机制，它能够将数据实际存储在已建立的结构中，比如B+树，甚至是分布式存储中。我们实现了带有B+树的M-Index，并在包含数十万张数字图像的数据库中对5个MPEG-7描述符的组合进行了实验。实验测试了几个M-Index变量，并将它们与两种正交方法(PM-Tree和iDistance)进行了比较。试验表明，M-Index在搜索空间修剪的效率、I/O成本和精确相似查询的响应时间方面优于其他方法。此外，M-Index展示了保持索引中相似数据接近的出色能力，这使得它的近似算法非常高效——随着数据集的增长，保持几乎恒定的响应时间，同时保持非常高的召回率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 Second International Workshop on Similarity Search and Applications

自引率

0.00%

发文量