{"title":"ShockHash: Near Optimal-Space Minimal Perfect Hashing Beyond Brute-Force","authors":"Hans-Peter Lehmann, Peter Sanders, Stefan Walzer","doi":"10.1007/s00453-025-01321-z","DOIUrl":null,"url":null,"abstract":"<div><p>A minimal perfect hash function (MPHF) maps a set <i>S</i> of <i>n</i> keys to the first <i>n</i> integers without collisions. There is a lower bound of <span>\\(n\\log _2e-\\mathcal {O}(\\log n) \\approx 1.44n\\)</span> bits needed to represent an MPHF. This can be reached by a <i>brute-force</i> algorithm that tries <span>\\(e^n\\)</span> hash function seeds in expectation and stores the first seed that leads to an MPHF. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash – <b>S</b>mall, <b>h</b>eavily <b>o</b>verloaded cu<b>ck</b>oo <b>hash</b> tables for minimal perfect hashing. ShockHash uses two hash functions <span>\\(h_0\\)</span> and <span>\\(h_1\\)</span>, hoping for the existence of a function <span>\\(f : S \\rightarrow \\{0,1\\}\\)</span> such that <span>\\(x \\mapsto h_{f(x)}(x)\\)</span> is an MPHF on <i>S</i>. It then uses a 1-bit retrieval data structure to store <i>f</i> using <span>\\(n + o(n)\\)</span> bits. In graph terminology, ShockHash generates <i>n</i>-edge random graphs until stumbling on a <i>pseudoforest</i> – where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. We show that ShockHash needs to try only about <span>\\((e/2)^n \\approx 1.359^n\\)</span> seeds in expectation. This reduces the space for storing the seed by roughly <i>n</i> bits (maintaining the asymptotically optimal space consumption) and speeds up construction by almost a factor of <span>\\(2^n\\)</span> compared to brute-force. <i>Bipartite</i> ShockHash reduces the expected construction time again to about <span>\\(1.166^n\\)</span> by maintaining a pool of candidate hash functions and checking all possible pairs. Using ShockHash as a building block within the RecSplit framework we obtain ShockHash-RS, which can be constructed up to 3 orders of magnitude faster than competing approaches. ShockHash-RS can build an MPHF for 10 million keys with 1.489 bits per key in about half an hour. When instead using ShockHash after an efficient <i>k</i>-perfect hash function, it achieves space usage similar to the best competitors, while being significantly faster to construct and query.</p></div>","PeriodicalId":50824,"journal":{"name":"Algorithmica","volume":"87 11","pages":"1620 - 1668"},"PeriodicalIF":0.7000,"publicationDate":"2025-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s00453-025-01321-z.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithmica","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s00453-025-01321-z","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
A minimal perfect hash function (MPHF) maps a set S of n keys to the first n integers without collisions. There is a lower bound of \(n\log _2e-\mathcal {O}(\log n) \approx 1.44n\) bits needed to represent an MPHF. This can be reached by a brute-force algorithm that tries \(e^n\) hash function seeds in expectation and stores the first seed that leads to an MPHF. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash – Small, heavily overloaded cuckoo hash tables for minimal perfect hashing. ShockHash uses two hash functions \(h_0\) and \(h_1\), hoping for the existence of a function \(f : S \rightarrow \{0,1\}\) such that \(x \mapsto h_{f(x)}(x)\) is an MPHF on S. It then uses a 1-bit retrieval data structure to store f using \(n + o(n)\) bits. In graph terminology, ShockHash generates n-edge random graphs until stumbling on a pseudoforest – where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. We show that ShockHash needs to try only about \((e/2)^n \approx 1.359^n\) seeds in expectation. This reduces the space for storing the seed by roughly n bits (maintaining the asymptotically optimal space consumption) and speeds up construction by almost a factor of \(2^n\) compared to brute-force. Bipartite ShockHash reduces the expected construction time again to about \(1.166^n\) by maintaining a pool of candidate hash functions and checking all possible pairs. Using ShockHash as a building block within the RecSplit framework we obtain ShockHash-RS, which can be constructed up to 3 orders of magnitude faster than competing approaches. ShockHash-RS can build an MPHF for 10 million keys with 1.489 bits per key in about half an hour. When instead using ShockHash after an efficient k-perfect hash function, it achieves space usage similar to the best competitors, while being significantly faster to construct and query.
期刊介绍:
Algorithmica is an international journal which publishes theoretical papers on algorithms that address problems arising in practical areas, and experimental papers of general appeal for practical importance or techniques. The development of algorithms is an integral part of computer science. The increasing complexity and scope of computer applications makes the design of efficient algorithms essential.
Algorithmica covers algorithms in applied areas such as: VLSI, distributed computing, parallel processing, automated design, robotics, graphics, data base design, software tools, as well as algorithms in fundamental areas such as sorting, searching, data structures, computational geometry, and linear programming.
In addition, the journal features two special sections: Application Experience, presenting findings obtained from applications of theoretical results to practical situations, and Problems, offering short papers presenting problems on selected topics of computer science.