ShockHash: Near Optimal-Space Minimal Perfect Hashing Beyond Brute-Force

IF 0.7 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Hans-Peter Lehmann, Peter Sanders, Stefan Walzer
{"title":"ShockHash: Near Optimal-Space Minimal Perfect Hashing Beyond Brute-Force","authors":"Hans-Peter Lehmann,&nbsp;Peter Sanders,&nbsp;Stefan Walzer","doi":"10.1007/s00453-025-01321-z","DOIUrl":null,"url":null,"abstract":"<div><p>A minimal perfect hash function (MPHF) maps a set <i>S</i> of <i>n</i> keys to the first <i>n</i> integers without collisions. There is a lower bound of <span>\\(n\\log _2e-\\mathcal {O}(\\log n) \\approx 1.44n\\)</span> bits needed to represent an MPHF. This can be reached by a <i>brute-force</i> algorithm that tries <span>\\(e^n\\)</span> hash function seeds in expectation and stores the first seed that leads to an MPHF. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash – <b>S</b>mall, <b>h</b>eavily <b>o</b>verloaded cu<b>ck</b>oo <b>hash</b> tables for minimal perfect hashing. ShockHash uses two hash functions <span>\\(h_0\\)</span> and <span>\\(h_1\\)</span>, hoping for the existence of a function <span>\\(f : S \\rightarrow \\{0,1\\}\\)</span> such that <span>\\(x \\mapsto h_{f(x)}(x)\\)</span> is an MPHF on <i>S</i>. It then uses a 1-bit retrieval data structure to store <i>f</i> using <span>\\(n + o(n)\\)</span> bits. In graph terminology, ShockHash generates <i>n</i>-edge random graphs until stumbling on a <i>pseudoforest</i> – where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. We show that ShockHash needs to try only about <span>\\((e/2)^n \\approx 1.359^n\\)</span> seeds in expectation. This reduces the space for storing the seed by roughly <i>n</i> bits (maintaining the asymptotically optimal space consumption) and speeds up construction by almost a factor of <span>\\(2^n\\)</span> compared to brute-force. <i>Bipartite</i> ShockHash reduces the expected construction time again to about <span>\\(1.166^n\\)</span> by maintaining a pool of candidate hash functions and checking all possible pairs. Using ShockHash as a building block within the RecSplit framework we obtain ShockHash-RS, which can be constructed up to 3 orders of magnitude faster than competing approaches. ShockHash-RS can build an MPHF for 10 million keys with 1.489 bits per key in about half an hour. When instead using ShockHash after an efficient <i>k</i>-perfect hash function, it achieves space usage similar to the best competitors, while being significantly faster to construct and query.</p></div>","PeriodicalId":50824,"journal":{"name":"Algorithmica","volume":"87 11","pages":"1620 - 1668"},"PeriodicalIF":0.7000,"publicationDate":"2025-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s00453-025-01321-z.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithmica","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s00453-025-01321-z","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

A minimal perfect hash function (MPHF) maps a set S of n keys to the first n integers without collisions. There is a lower bound of \(n\log _2e-\mathcal {O}(\log n) \approx 1.44n\) bits needed to represent an MPHF. This can be reached by a brute-force algorithm that tries \(e^n\) hash function seeds in expectation and stores the first seed that leads to an MPHF. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash – Small, heavily overloaded cuckoo hash tables for minimal perfect hashing. ShockHash uses two hash functions \(h_0\) and \(h_1\), hoping for the existence of a function \(f : S \rightarrow \{0,1\}\) such that \(x \mapsto h_{f(x)}(x)\) is an MPHF on S. It then uses a 1-bit retrieval data structure to store f using \(n + o(n)\) bits. In graph terminology, ShockHash generates n-edge random graphs until stumbling on a pseudoforest – where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. We show that ShockHash needs to try only about \((e/2)^n \approx 1.359^n\) seeds in expectation. This reduces the space for storing the seed by roughly n bits (maintaining the asymptotically optimal space consumption) and speeds up construction by almost a factor of \(2^n\) compared to brute-force. Bipartite ShockHash reduces the expected construction time again to about \(1.166^n\) by maintaining a pool of candidate hash functions and checking all possible pairs. Using ShockHash as a building block within the RecSplit framework we obtain ShockHash-RS, which can be constructed up to 3 orders of magnitude faster than competing approaches. ShockHash-RS can build an MPHF for 10 million keys with 1.489 bits per key in about half an hour. When instead using ShockHash after an efficient k-perfect hash function, it achieves space usage similar to the best competitors, while being significantly faster to construct and query.

shockash:超越蛮力的近最优空间最小完美哈希
最小完美哈希函数(MPHF)将一个包含n个键的集合S映射到前n个整数而不发生冲突。表示MPHF需要一个\(n\log _2e-\mathcal {O}(\log n) \approx 1.44n\)位的下界。这可以通过一种蛮力算法来实现,该算法尝试\(e^n\)哈希函数种子,并存储导致MPHF的第一个种子。以前构造mphf的最节省空间的算法都使用这种蛮力方法作为基本构建块。在本文中,我们引入了shockash——用于最小完美哈希的小的、重重载的布谷鸟哈希表。shockash使用两个散列函数\(h_0\)和\(h_1\),希望存在一个函数\(f : S \rightarrow \{0,1\}\),使得\(x \mapsto h_{f(x)}(x)\)是s上的MPHF。然后使用1位检索数据结构使用\(n + o(n)\)位来存储f。在图的术语中,shockash生成n条边的随机图,直到偶然发现一个伪森林——其中每个组件包含与节点一样多的边。然后,使用布谷鸟哈希,shockash在线性时间内从伪森林中获得一个MPHF。我们证明了shockash只需要在预期中尝试\((e/2)^n \approx 1.359^n\)种子。这将存储种子的空间减少了大约n位(保持渐近最优的空间消耗),并且与暴力破解相比,几乎可以将构建速度提高\(2^n\)。通过维护候选散列函数池并检查所有可能的对,Bipartite shockash将预期的构建时间再次减少到\(1.166^n\)左右。使用shockash作为RecSplit框架中的构建块,我们获得了shockash - rs,它的构建速度比竞争对手的方法快3个数量级。shockash - rs可以在大约半小时内构建1000万个密钥的MPHF,每个密钥为1,489位。当在一个有效的k-完美哈希函数之后使用shockash时,它实现了与最佳竞争对手相似的空间使用,同时构造和查询速度明显更快。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Algorithmica
Algorithmica 工程技术-计算机:软件工程
CiteScore
2.80
自引率
9.10%
发文量
158
审稿时长
12 months
期刊介绍: Algorithmica is an international journal which publishes theoretical papers on algorithms that address problems arising in practical areas, and experimental papers of general appeal for practical importance or techniques. The development of algorithms is an integral part of computer science. The increasing complexity and scope of computer applications makes the design of efficient algorithms essential. Algorithmica covers algorithms in applied areas such as: VLSI, distributed computing, parallel processing, automated design, robotics, graphics, data base design, software tools, as well as algorithms in fundamental areas such as sorting, searching, data structures, computational geometry, and linear programming. In addition, the journal features two special sections: Application Experience, presenting findings obtained from applications of theoretical results to practical situations, and Problems, offering short papers presenting problems on selected topics of computer science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信