快速、干净：通过约束求解实现可审计的高性能装配

IACR Transactions on Cryptographic Hardware and Embedded Systems Pub Date : 2023-12-04 DOI:10.46586/tches.v2024.i1.87-132

Aminudeen Abdulrahman, Hanno Becker, Matthias J. Kannwischer, Fabien Klein

{"title":"快速、干净：通过约束求解实现可审计的高性能装配","authors":"Aminudeen Abdulrahman, Hanno Becker, Matthias J. Kannwischer, Fabien Klein","doi":"10.46586/tches.v2024.i1.87-132","DOIUrl":null,"url":null,"abstract":"Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice.In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture.We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.","PeriodicalId":321490,"journal":{"name":"IACR Transactions on Cryptographic Hardware and Embedded Systems","volume":"14 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fast and Clean: Auditable high-performance assembly via constraint solving\",\"authors\":\"Aminudeen Abdulrahman, Hanno Becker, Matthias J. Kannwischer, Fabien Klein\",\"doi\":\"10.46586/tches.v2024.i1.87-132\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice.In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture.We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.\",\"PeriodicalId\":321490,\"journal\":{\"name\":\"IACR Transactions on Cryptographic Hardware and Embedded Systems\",\"volume\":\"14 5\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IACR Transactions on Cryptographic Hardware and Embedded Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.46586/tches.v2024.i1.87-132\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IACR Transactions on Cryptographic Hardware and Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46586/tches.v2024.i1.87-132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

手写汇编是高性能加密技术开发中广泛使用的工具:通过提供对指令选择、指令调度和寄存器分配的完全控制，可以实现最高性能。另一方面，开发手写组装不仅耗时，而且产生的工件也往往难以审查和维护-威胁到它们在实践中使用的适用性。在这项工作中，我们提出了SLOTHY(超级(懒惰)优化棘手的手写汇编)，这是一个自动超优化汇编的框架，涉及指令调度，寄存器分配和循环优化(软件流水线):使用SLOTHY，开发人员控制并关注算法和指令选择，提供汇编中可读的“基础”实现，而SLOTHY根据目标(微)体系结构模型自动找到最佳和可跟踪的指令调度和寄存器分配策略。我们通过实例化Cortex-M55、Cortex-M85、Cortex-A55和cortex - a72微架构模型，实现Armv8.1-M+Helium和AArch64+Neon架构，展示了SLOTHY的灵活性。我们使用生成的工具来优化三种工作负载:首先，对于Cortex-M55和Cortex-M85，在定点和浮点运算中使用基数-4复快速傅里叶变换(FFT)，这是数字信号处理的基础。其次，在Cortex-M55、Cortex-M85、Cortex-A55和Cortex-A72上，最近宣布的两个NIST后量子加密标准化项目的获奖者——CRYSTALS-Kyber和CRYSTALS-Dilithium的数论变换(NTT)实例。第三，对于Cortex-A55，标量乘法为椭圆曲线键交换X25519。在保持紧凑性和可读性的同时，sloty优化的代码在所有情况下都可以匹配或优于现有技术的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fast and Clean: Auditable high-performance assembly via constraint solving

Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice.In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture.We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IACR Transactions on Cryptographic Hardware and Embedded Systems

自引率

0.00%

发文量