Scalable similarity search with optimized kernel hashing

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI:10.1145/1835804.1835946

Junfeng He, W. Liu, Shih-Fu Chang

{"title":"Scalable similarity search with optimized kernel hashing","authors":"Junfeng He, W. Liu, Shih-Fu Chang","doi":"10.1145/1835804.1835946","DOIUrl":null,"url":null,"abstract":"Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus low redundancy) among generated codes. However, most existing hash methods are designed only for vector data. In this paper, we develop a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on. Starting with the idea analogous to spectral hashing, novel formulations and solutions are proposed such that a kernel based hash function can be explicitly represented and optimized, and directly applied to compute compact hash codes for new samples of general formats. Moreover, we incorporate efficient techniques, such as Nystrom approximation, to further reduce time and space complexity for indexing and search, making our algorithm scalable to huge data sets. Another important advantage of our method is the ability to handle diverse types of similarities according to actual task requirements, including both feature similarities and semantic similarities like label consistency. We evaluate our method using both vector and non-vector data sets at a large scale up to 1 million samples. Our comprehensive results show the proposed method outperforms several state-of-the-art approaches for all the tasks, with a significant gain for most tasks.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"55 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"147","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1835804.1835946","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 147

Abstract

Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus low redundancy) among generated codes. However, most existing hash methods are designed only for vector data. In this paper, we develop a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on. Starting with the idea analogous to spectral hashing, novel formulations and solutions are proposed such that a kernel based hash function can be explicitly represented and optimized, and directly applied to compute compact hash codes for new samples of general formats. Moreover, we incorporate efficient techniques, such as Nystrom approximation, to further reduce time and space complexity for indexing and search, making our algorithm scalable to huge data sets. Another important advantage of our method is the ability to handle diverse types of similarities according to actual task requirements, including both feature similarities and semantic similarities like label consistency. We evaluate our method using both vector and non-vector data sets at a large scale up to 1 million samples. Our comprehensive results show the proposed method outperforms several state-of-the-art approaches for all the tasks, with a significant gain for most tasks.

查看原文本刊更多论文

可扩展的相似性搜索与优化的内核哈希

可扩展的相似度搜索是许多大规模学习或数据挖掘应用的核心。最近，许多研究结果表明，一种有前途的方法是创建紧凑高效的哈希码，以保持数据的相似性。通过高效，我们指的是生成的代码之间的低相关性(从而低冗余)。然而，大多数现有的哈希方法仅针对矢量数据设计。本文开发了一种新的哈希算法，用于生成具有任意核函数的通用格式的大规模数据的高效代码，包括向量核、图核、序列核、集合核等。从类似于谱哈希的想法开始，提出了新的公式和解决方案，使得基于核的哈希函数可以显式表示和优化，并直接应用于计算一般格式新样本的紧凑哈希码。此外，我们结合了高效的技术，如Nystrom近似，以进一步降低索引和搜索的时间和空间复杂性，使我们的算法可扩展到庞大的数据集。我们方法的另一个重要优点是能够根据实际任务需求处理不同类型的相似度，包括特征相似度和标签一致性等语义相似度。我们在多达100万个样本的大规模上使用向量和非向量数据集来评估我们的方法。我们的综合结果表明，所提出的方法在所有任务中都优于几种最先进的方法，对于大多数任务都有显着的增益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量