Fast and Clean: Auditable high-performance assembly via constraint solving

IACR Transactions on Cryptographic Hardware and Embedded Systems Pub Date : 2023-12-04 DOI:10.46586/tches.v2024.i1.87-132

Aminudeen Abdulrahman, Hanno Becker, Matthias J. Kannwischer, Fabien Klein

{"title":"Fast and Clean: Auditable high-performance assembly via constraint solving","authors":"Aminudeen Abdulrahman, Hanno Becker, Matthias J. Kannwischer, Fabien Klein","doi":"10.46586/tches.v2024.i1.87-132","DOIUrl":null,"url":null,"abstract":"Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice.In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture.We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.","PeriodicalId":321490,"journal":{"name":"IACR Transactions on Cryptographic Hardware and Embedded Systems","volume":"14 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IACR Transactions on Cryptographic Hardware and Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46586/tches.v2024.i1.87-132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice.In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture.We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.

查看原文本刊更多论文

快速、干净：通过约束求解实现可审计的高性能装配

手写汇编是高性能加密技术开发中广泛使用的工具:通过提供对指令选择、指令调度和寄存器分配的完全控制，可以实现最高性能。另一方面，开发手写组装不仅耗时，而且产生的工件也往往难以审查和维护-威胁到它们在实践中使用的适用性。在这项工作中，我们提出了SLOTHY(超级(懒惰)优化棘手的手写汇编)，这是一个自动超优化汇编的框架，涉及指令调度，寄存器分配和循环优化(软件流水线):使用SLOTHY，开发人员控制并关注算法和指令选择，提供汇编中可读的“基础”实现，而SLOTHY根据目标(微)体系结构模型自动找到最佳和可跟踪的指令调度和寄存器分配策略。我们通过实例化Cortex-M55、Cortex-M85、Cortex-A55和cortex - a72微架构模型，实现Armv8.1-M+Helium和AArch64+Neon架构，展示了SLOTHY的灵活性。我们使用生成的工具来优化三种工作负载:首先，对于Cortex-M55和Cortex-M85，在定点和浮点运算中使用基数-4复快速傅里叶变换(FFT)，这是数字信号处理的基础。其次，在Cortex-M55、Cortex-M85、Cortex-A55和Cortex-A72上，最近宣布的两个NIST后量子加密标准化项目的获奖者——CRYSTALS-Kyber和CRYSTALS-Dilithium的数论变换(NTT)实例。第三，对于Cortex-A55，标量乘法为椭圆曲线键交换X25519。在保持紧凑性和可读性的同时，sloty优化的代码在所有情况下都可以匹配或优于现有技术的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IACR Transactions on Cryptographic Hardware and Embedded Systems

自引率

0.00%

发文量