UpWB：使用矢量化蒙哥马利乘法的白盒密码学非耦合架构设计

IACR Transactions on Cryptographic Hardware and Embedded Systems Pub Date : 2024-03-12 DOI:10.46586/tches.v2024.i2.677-713

Xiangren Chen, Bohan Yang, Jianfeng Zhu, Jun Liu, Shuying Yin, Guang Yang, Min Zhu, Shaojun Wei, Leibo Liu

{"title":"UpWB：使用矢量化蒙哥马利乘法的白盒密码学非耦合架构设计","authors":"Xiangren Chen, Bohan Yang, Jianfeng Zhu, Jun Liu, Shuying Yin, Guang Yang, Min Zhu, Shaojun Wei, Leibo Liu","doi":"10.46586/tches.v2024.i2.677-713","DOIUrl":null,"url":null,"abstract":"White-box cryptography (WBC) seeks to protect secret keys even if the attacker has full control over the execution environment. One of the techniques to hide the key is space hardness approach, which conceals the key into a large lookup table generated from a reliable small block cipher. Despite its provable security, space-hard WBC also suffers from heavy performance overhead when executed on general purpose hardware platform, hundreds of magnitude slower than conventional block ciphers. Specifically, recent studies adopt nested substitution permutation network (NSPN) to construct dedicated white-box block cipher [BIT16], whose performance is limited by a massive number of rounds, nested loop dependency and high-dimension dynamic maximal distance separable (MDS) matrices.To address these limitations, we put forward UpWB, an uncoupled and efficient accelerator for NSPN-structure WBC. We propose holistic optimization techniques across timing schedule, algorithms and operators. For the high-level timing schedule, we propose a fine-grained task partition (FTP) mechanism to decouple the parameteroriented nested loop with different trip counts. The FTP mechanism narrows down the idle time for synchronization and avoids the extra usage of FIFO, which efficiently increases the computation throughput. For the optimization of arithmetic operators, we devise a flexible and vectorized modular multiplier (VMM) based on the complexity-reduced Montgomery algorithm, which can process multi-precision variable data, multi-size matrix-vector multiplication and different irreducible polynomials. Then, a configurable matrix-vector multiplication (MVM) architecture with diagonal-major dataflow is presented to handle the dynamic MDS matrix. The multi-scale (Inv)Mixcolumns are also unified in a compact manner by intensively sharing the common sub-operations and customizing the constant multiplier.To verify the proposed methodology, we showcase the unified design implementation for three recent families of WBCs, including SPNbox-8/16/24/32, Yoroi-16/32 and WARX-16. Evaluated on FPGA platform, UpWB outperforms the optimized software counterpart (executed on 3.2 GHz Intel CPU with AES-NI and AVX2 instructions) by 7x to 30x in terms of computation throughput. Synthesized under TSMC 28nm technology, 36x to 164x improvement of computation throughput is achieved when UpWB operates at the maximum frequency of 1.3 GHz and consumes a modest area 0.14 mm2. Besides, the proposed VMM also offers about 30% improvement of area efficiency without pulling flexibility down when compared to state-of-the-art work.","PeriodicalId":321490,"journal":{"name":"IACR Transactions on Cryptographic Hardware and Embedded Systems","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"UpWB: An Uncoupled Architecture Design for White-box Cryptography Using Vectorized Montgomery Multiplication\",\"authors\":\"Xiangren Chen, Bohan Yang, Jianfeng Zhu, Jun Liu, Shuying Yin, Guang Yang, Min Zhu, Shaojun Wei, Leibo Liu\",\"doi\":\"10.46586/tches.v2024.i2.677-713\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"White-box cryptography (WBC) seeks to protect secret keys even if the attacker has full control over the execution environment. One of the techniques to hide the key is space hardness approach, which conceals the key into a large lookup table generated from a reliable small block cipher. Despite its provable security, space-hard WBC also suffers from heavy performance overhead when executed on general purpose hardware platform, hundreds of magnitude slower than conventional block ciphers. Specifically, recent studies adopt nested substitution permutation network (NSPN) to construct dedicated white-box block cipher [BIT16], whose performance is limited by a massive number of rounds, nested loop dependency and high-dimension dynamic maximal distance separable (MDS) matrices.To address these limitations, we put forward UpWB, an uncoupled and efficient accelerator for NSPN-structure WBC. We propose holistic optimization techniques across timing schedule, algorithms and operators. For the high-level timing schedule, we propose a fine-grained task partition (FTP) mechanism to decouple the parameteroriented nested loop with different trip counts. The FTP mechanism narrows down the idle time for synchronization and avoids the extra usage of FIFO, which efficiently increases the computation throughput. For the optimization of arithmetic operators, we devise a flexible and vectorized modular multiplier (VMM) based on the complexity-reduced Montgomery algorithm, which can process multi-precision variable data, multi-size matrix-vector multiplication and different irreducible polynomials. Then, a configurable matrix-vector multiplication (MVM) architecture with diagonal-major dataflow is presented to handle the dynamic MDS matrix. The multi-scale (Inv)Mixcolumns are also unified in a compact manner by intensively sharing the common sub-operations and customizing the constant multiplier.To verify the proposed methodology, we showcase the unified design implementation for three recent families of WBCs, including SPNbox-8/16/24/32, Yoroi-16/32 and WARX-16. Evaluated on FPGA platform, UpWB outperforms the optimized software counterpart (executed on 3.2 GHz Intel CPU with AES-NI and AVX2 instructions) by 7x to 30x in terms of computation throughput. Synthesized under TSMC 28nm technology, 36x to 164x improvement of computation throughput is achieved when UpWB operates at the maximum frequency of 1.3 GHz and consumes a modest area 0.14 mm2. Besides, the proposed VMM also offers about 30% improvement of area efficiency without pulling flexibility down when compared to state-of-the-art work.\",\"PeriodicalId\":321490,\"journal\":{\"name\":\"IACR Transactions on Cryptographic Hardware and Embedded Systems\",\"volume\":\"3 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IACR Transactions on Cryptographic Hardware and Embedded Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.46586/tches.v2024.i2.677-713\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IACR Transactions on Cryptographic Hardware and Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46586/tches.v2024.i2.677-713","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

白盒加密法（WBC）旨在保护秘密密钥，即使攻击者完全控制了执行环境。空间硬度方法是隐藏密钥的技术之一，它将密钥隐藏在由可靠的小型块密码生成的大型查找表中。尽管其安全性是可以证明的，但在通用硬件平台上执行时，空间硬度 WBC 的性能开销也很大，比传统的块密码慢几百倍。具体来说，最近的研究采用嵌套置换置换网络（NSPN）来构建专用的白盒区块密码[BIT16]，其性能受到大量回合、嵌套循环依赖性和高维动态最大距离可分离（MDS）矩阵的限制。我们提出了跨越时序安排、算法和算子的整体优化技术。对于高级定时计划，我们提出了细粒度任务分区（FTP）机制，以解耦不同行程次数的面向参数的嵌套循环。FTP 机制缩短了同步的空闲时间，避免了 FIFO 的额外使用，从而有效提高了计算吞吐量。为了优化算术运算器，我们设计了一种基于复杂度降低的蒙哥马利算法的灵活矢量化模块乘法器（VMM），它可以处理多精度变量数据、多大小矩阵-矢量乘法和不同的不可还原多项式。然后，介绍了一种具有对角线主数据流的可配置矩阵-矢量乘法（MVM）架构，以处理动态 MDS 矩阵。为了验证所提出的方法，我们展示了针对 SPNbox-8/16/24/32、Yoroi-16/32 和 WARX-16 等三个最新 WBC 系列的统一设计实现。在 FPGA 平台上进行评估后发现，UpWB 的计算吞吐量比优化软件（在 3.2 GHz 英特尔 CPU 上使用 AES-NI 和 AVX2 指令执行）高出 7 到 30 倍。在台积电 28nm 技术下合成的 UpWB 在最高频率 1.3 GHz 下运行时，计算吞吐量提高了 36 倍到 164 倍，所占面积仅为 0.14 平方毫米。此外，与最先进的技术相比，所提出的 VMM 还在不降低灵活性的情况下提高了约 30% 的面积效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

UpWB: An Uncoupled Architecture Design for White-box Cryptography Using Vectorized Montgomery Multiplication

White-box cryptography (WBC) seeks to protect secret keys even if the attacker has full control over the execution environment. One of the techniques to hide the key is space hardness approach, which conceals the key into a large lookup table generated from a reliable small block cipher. Despite its provable security, space-hard WBC also suffers from heavy performance overhead when executed on general purpose hardware platform, hundreds of magnitude slower than conventional block ciphers. Specifically, recent studies adopt nested substitution permutation network (NSPN) to construct dedicated white-box block cipher [BIT16], whose performance is limited by a massive number of rounds, nested loop dependency and high-dimension dynamic maximal distance separable (MDS) matrices.To address these limitations, we put forward UpWB, an uncoupled and efficient accelerator for NSPN-structure WBC. We propose holistic optimization techniques across timing schedule, algorithms and operators. For the high-level timing schedule, we propose a fine-grained task partition (FTP) mechanism to decouple the parameteroriented nested loop with different trip counts. The FTP mechanism narrows down the idle time for synchronization and avoids the extra usage of FIFO, which efficiently increases the computation throughput. For the optimization of arithmetic operators, we devise a flexible and vectorized modular multiplier (VMM) based on the complexity-reduced Montgomery algorithm, which can process multi-precision variable data, multi-size matrix-vector multiplication and different irreducible polynomials. Then, a configurable matrix-vector multiplication (MVM) architecture with diagonal-major dataflow is presented to handle the dynamic MDS matrix. The multi-scale (Inv)Mixcolumns are also unified in a compact manner by intensively sharing the common sub-operations and customizing the constant multiplier.To verify the proposed methodology, we showcase the unified design implementation for three recent families of WBCs, including SPNbox-8/16/24/32, Yoroi-16/32 and WARX-16. Evaluated on FPGA platform, UpWB outperforms the optimized software counterpart (executed on 3.2 GHz Intel CPU with AES-NI and AVX2 instructions) by 7x to 30x in terms of computation throughput. Synthesized under TSMC 28nm technology, 36x to 164x improvement of computation throughput is achieved when UpWB operates at the maximum frequency of 1.3 GHz and consumes a modest area 0.14 mm2. Besides, the proposed VMM also offers about 30% improvement of area efficiency without pulling flexibility down when compared to state-of-the-art work.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IACR Transactions on Cryptographic Hardware and Embedded Systems

自引率

0.00%

发文量