b-bit minwise hashing in practice

Proceedings of the 5th Asia-Pacific Symposium on Internetware Pub Date : 2013-10-23 DOI:10.1145/2532443.2532446

Ping Li, Anshumali Shrivastava, A. König

{"title":"b-bit minwise hashing in practice","authors":"Ping Li, Anshumali Shrivastava, A. König","doi":"10.1145/2532443.2532446","DOIUrl":null,"url":null,"abstract":"Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 ~ 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes im- possible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.","PeriodicalId":362187,"journal":{"name":"Proceedings of the 5th Asia-Pacific Symposium on Internetware","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th Asia-Pacific Symposium on Internetware","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2532443.2532446","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 ~ 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes im- possible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.

查看原文本刊更多论文

实践中的b位最小哈希

最小哈希是搜索近似集相似度的标准技术。最近的工作[26,32]证明了b位最小散列[23,24]的潜在用途，用于对大量高维二进制数据(这在Web搜索和文本挖掘中的许多应用中是典型的)进行有效的搜索和学习。在本文中，我们将重点关注在将b位最小散列应用于工业应用中经常使用的数据量之前必须解决的一些关键问题。最小哈希需要一个昂贵的预处理步骤，在对每个数据向量应用相应的排列后计算k(例如，500)个最小值。我们开发了一种使用gpu的并行化方案，并观察到预处理时间可以减少20 ~ 80倍，并且大大小于数据加载时间。减少预处理时间在实践中是非常有益的，例如，对于重复的Web页面检测(其中最小哈希是爬行管道中的主要步骤)或提高在线分类器的测试速度。另一个关键问题是，对于非常大的数据集，由于其空间要求，存储(完全)随机排列矩阵变得不可能。我们的论文是第一个证明使用简单哈希函数实现的b位最小哈希的研究，例如，2-通用(2U)和4-通用(4U)哈希族，可以产生与使用完全随机排列非常相似的学习结果。在高达200GB的数据集上进行了实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th Asia-Pacific Symposium on Internetware

自引率

0.00%

发文量