An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU

IF 1.8 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-04-16 DOI:10.1145/3659209

Ziheng Wang, Xiaoshe Dong, Yan Kang, Heng Chen, Qiang Wang

{"title":"An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU","authors":"Ziheng Wang, Xiaoshe Dong, Yan Kang, Heng Chen, Qiang Wang","doi":"10.1145/3659209","DOIUrl":null,"url":null,"abstract":"<p>The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on serial Merkle tree traversal, which is not conducive to taking full advantage of the computing power of parallel architectures such as CPUs and GPUs. We propose a parallel Merkle tree traversal (PMTT), which is tested by implementing LMS on the GPU. This is the first work accelerating LMS on the GPU, which performs well even with over 10,000 cores. Considering different scenarios of algorithmic parallelism and data parallelism, we implement corresponding variants for PMTT. The design of PMTT for algorithmic parallelism mainly considers the execution efficiency of a single task, while that for data parallelism starts with the full utilisation of GPU performance. In addition, we are the first to design a CPU-GPU collaborative processing solution for traversal algorithms to reduce the communication overhead between CPU and GPU. For algorithmic parallelism, our implementation is still 4.48 × faster than the ideal time of the state-of-the-art traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8192, the parallel efficiency is 78.39%. In comparison, our LMS implementation outperforms most existing LMS and XMSS implementations.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"67 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3659209","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on serial Merkle tree traversal, which is not conducive to taking full advantage of the computing power of parallel architectures such as CPUs and GPUs. We propose a parallel Merkle tree traversal (PMTT), which is tested by implementing LMS on the GPU. This is the first work accelerating LMS on the GPU, which performs well even with over 10,000 cores. Considering different scenarios of algorithmic parallelism and data parallelism, we implement corresponding variants for PMTT. The design of PMTT for algorithmic parallelism mainly considers the execution efficiency of a single task, while that for data parallelism starts with the full utilisation of GPU performance. In addition, we are the first to design a CPU-GPU collaborative processing solution for traversal algorithms to reduce the communication overhead between CPU and GPU. For algorithmic parallelism, our implementation is still 4.48 × faster than the ideal time of the state-of-the-art traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8192, the parallel efficiency is 78.39%. In comparison, our LMS implementation outperforms most existing LMS and XMSS implementations.

查看原文本刊更多论文

并行梅克尔树遍历实例：GPU 上的后量子莱顿-米卡里签名

在众多后量子加密（PQC）算法中，基于哈希的签名（HBS）是最保守、最耗时的算法。LMS 和 XMSS 这两种 HBS 是目前唯一被美国国家标准与技术研究院（NIST）标准化的 PQC 算法。现有的 HBS 是基于串行 Merkle 树遍历设计的，不利于充分利用 CPU 和 GPU 等并行架构的计算能力。我们提出了一种并行梅克尔树遍历（PMTT），并通过在 GPU 上实现 LMS 对其进行了测试。这是首个在 GPU 上加速 LMS 的研究成果，即使在超过 10,000 个内核的情况下也能表现出色。考虑到算法并行和数据并行的不同情况，我们为 PMTT 实现了相应的变体。算法并行的 PMTT 设计主要考虑单个任务的执行效率，而数据并行的 PMTT 设计则以充分利用 GPU 性能为出发点。此外，我们还首次为遍历算法设计了CPU-GPU协同处理方案，以减少CPU和GPU之间的通信开销。在算法并行性方面，我们的实现仍然比最先进的遍历算法的理想时间快 4.48 倍。在数据并行方面，当核数从 1 增加到 8192 时，并行效率为 78.39%。相比之下，我们的 LMS 实现优于大多数现有的 LMS 和 XMSS 实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Architecture and Code Optimization 工程技术-计算机：理论方法

CiteScore

3.60

自引率

6.20%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.