高维数据集的相似度搜索

Computer Vision meets Databases Pub Date : 2005-06-17 DOI:10.1145/1160939.1160941

R. Ramakrishnan, J. Goldstein, U. Shaft

{"title":"高维数据集的相似度搜索","authors":"R. Ramakrishnan, J. Goldstein, U. Shaft","doi":"10.1145/1160939.1160941","DOIUrl":null,"url":null,"abstract":"The problem of finding \"similar\" multimedia objects is a central one, and a popular approach is to represent objects as vectors in a high-dimensional space, and to build a spatial index over a collection of such vectors in order to retrieve the \"nearest neighbors\" of a query object. There are some fundamental assumptions involved here. First, that the user's notion of similarity can be captured by distance in the space that the vectors are embedded, and second, that nearest neighbors can be efficiently retrieved. In this talk, we discuss these assumptions, based on our experience with the PiQ image database project, carried out at the University of Wisconsin-Madison, and some subsequent work.We will first present a brief overview of the PiQ system and its goal of identifying the DBMS infrastructure required to support image databases, and discuss the role of similarity and nearest-neighbor queries in content-based querying. Next, we consider when the notion of \"nearest neighbor\" is well-defined in high-dimensional spaces, and when efficient indexing is feasible. The goal is not to suggest that indexing high-dimensional data is impossible, although our results here are mainly negative. Rather, we seek to identify the conditions under which effective indexing and retrieval techniques are feasible, and to identify the key difficulties that must be overcome. Finally, we present some indexing techniques to retrieve nearest neighbors under appropriate conditions, highlighting the role played by redundancy and approximation.","PeriodicalId":346313,"journal":{"name":"Computer Vision meets Databases","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Similarity search in high-dimensional datasets\",\"authors\":\"R. Ramakrishnan, J. Goldstein, U. Shaft\",\"doi\":\"10.1145/1160939.1160941\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The problem of finding \\\"similar\\\" multimedia objects is a central one, and a popular approach is to represent objects as vectors in a high-dimensional space, and to build a spatial index over a collection of such vectors in order to retrieve the \\\"nearest neighbors\\\" of a query object. There are some fundamental assumptions involved here. First, that the user's notion of similarity can be captured by distance in the space that the vectors are embedded, and second, that nearest neighbors can be efficiently retrieved. In this talk, we discuss these assumptions, based on our experience with the PiQ image database project, carried out at the University of Wisconsin-Madison, and some subsequent work.We will first present a brief overview of the PiQ system and its goal of identifying the DBMS infrastructure required to support image databases, and discuss the role of similarity and nearest-neighbor queries in content-based querying. Next, we consider when the notion of \\\"nearest neighbor\\\" is well-defined in high-dimensional spaces, and when efficient indexing is feasible. The goal is not to suggest that indexing high-dimensional data is impossible, although our results here are mainly negative. Rather, we seek to identify the conditions under which effective indexing and retrieval techniques are feasible, and to identify the key difficulties that must be overcome. Finally, we present some indexing techniques to retrieve nearest neighbors under appropriate conditions, highlighting the role played by redundancy and approximation.\",\"PeriodicalId\":346313,\"journal\":{\"name\":\"Computer Vision meets Databases\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision meets Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1160939.1160941\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision meets Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1160939.1160941","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

查找“相似”多媒体对象的问题是一个中心问题，一种流行的方法是将对象表示为高维空间中的向量，并在这些向量的集合上构建空间索引，以便检索查询对象的“最近邻居”。这里有一些基本的假设。首先，用户对相似性的概念可以通过向量嵌入空间中的距离来捕获，其次，可以有效地检索最近邻。在这次演讲中，我们将根据我们在威斯康星大学麦迪逊分校开展的PiQ图像数据库项目的经验，以及一些后续工作，讨论这些假设。我们将首先简要概述PiQ系统及其确定支持图像数据库所需的DBMS基础设施的目标，并讨论相似性和最近邻查询在基于内容的查询中的作用。接下来，我们考虑在高维空间中何时定义了“最近邻”的概念，以及何时有效索引是可行的。我们的目标并不是建议对高维数据进行索引是不可能的，尽管我们这里的结果主要是否定的。相反，我们试图确定有效的索引和检索技术可行的条件，并确定必须克服的关键困难。最后，我们提出了一些在适当条件下检索最近邻的索引技术，强调了冗余和近似所起的作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Similarity search in high-dimensional datasets

The problem of finding "similar" multimedia objects is a central one, and a popular approach is to represent objects as vectors in a high-dimensional space, and to build a spatial index over a collection of such vectors in order to retrieve the "nearest neighbors" of a query object. There are some fundamental assumptions involved here. First, that the user's notion of similarity can be captured by distance in the space that the vectors are embedded, and second, that nearest neighbors can be efficiently retrieved. In this talk, we discuss these assumptions, based on our experience with the PiQ image database project, carried out at the University of Wisconsin-Madison, and some subsequent work.We will first present a brief overview of the PiQ system and its goal of identifying the DBMS infrastructure required to support image databases, and discuss the role of similarity and nearest-neighbor queries in content-based querying. Next, we consider when the notion of "nearest neighbor" is well-defined in high-dimensional spaces, and when efficient indexing is feasible. The goal is not to suggest that indexing high-dimensional data is impossible, although our results here are mainly negative. Rather, we seek to identify the conditions under which effective indexing and retrieval techniques are feasible, and to identify the key difficulties that must be overcome. Finally, we present some indexing techniques to retrieve nearest neighbors under appropriate conditions, highlighting the role played by redundancy and approximation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision meets Databases

自引率

0.00%

发文量