TurboHE:利用FPGA集群加速全同态加密

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI:10.1109/IPDPS54959.2023.00084

Haohao Liao, Mahmoud A. Elmohr, Xuan Dong, Yanjun Qian, Wenzhe Yang, Zhiwei Shang, Yin Tan

{"title":"TurboHE:利用FPGA集群加速全同态加密","authors":"Haohao Liao, Mahmoud A. Elmohr, Xuan Dong, Yanjun Qian, Wenzhe Yang, Zhiwei Shang, Yin Tan","doi":"10.1109/IPDPS54959.2023.00084","DOIUrl":null,"url":null,"abstract":"With the burgeoning demands for cloud computing in various fields followed by the rising attention to sensitive data exposure, Fully Homomorphic Encryption (FHE) is gaining popularity as a potential solution to privacy protection. By performing computations directly on the ciphertext (encrypted data) without decrypting it, FHE can guarantee the security of data throughout its lifecycle without compromising the privacy. However, the excruciatingly slow speed of FHE scheme makes adopting it impractical in real life applications. Therefore, hardware accelerators come to the rescue to mitigate the problem. Among various hardware platforms, FPGA clusters are particularly promising because of their flexibility and ready availability at many cloud providers such as FPGA-as-a-Service (FaaS). Hence, reusing the existing infrastructure can greatly facilitate the implementation of FHE on the cloud.In this paper, we present TurboHE, the first hardware accelerator for FHE operations based on an FPGA cluster. TurboHE aims to boost the performance of CKKS, one of the fastest FHE schemes which is most suitable to machine learning applications, by accelerating its computationally intensive and frequently used operation: relinearization. The proposed scalable architecture based on hardware partitioning can be easily configured to accommodate high acceleration requirements for relinearization with very large CKKS parameters. As a demonstration, an implementation, which supports 32,768 polynomial coefficients and a coefficient bitwidth of 594 decomposed into 11 Residue Number System (RNS) components, was deployed on a cluster consisting of 9 Xilinx VU13P FPGAs. The cluster operated at 200 MHz and achieved 1096 times throughput compared with a single threaded CPU implementation. Moreover, the low level hardware components implemented in this work such as the NTT module can also be applied to accelerate other lattice-based cryptography schemes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TurboHE: Accelerating Fully Homomorphic Encryption Using FPGA Clusters\",\"authors\":\"Haohao Liao, Mahmoud A. Elmohr, Xuan Dong, Yanjun Qian, Wenzhe Yang, Zhiwei Shang, Yin Tan\",\"doi\":\"10.1109/IPDPS54959.2023.00084\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the burgeoning demands for cloud computing in various fields followed by the rising attention to sensitive data exposure, Fully Homomorphic Encryption (FHE) is gaining popularity as a potential solution to privacy protection. By performing computations directly on the ciphertext (encrypted data) without decrypting it, FHE can guarantee the security of data throughout its lifecycle without compromising the privacy. However, the excruciatingly slow speed of FHE scheme makes adopting it impractical in real life applications. Therefore, hardware accelerators come to the rescue to mitigate the problem. Among various hardware platforms, FPGA clusters are particularly promising because of their flexibility and ready availability at many cloud providers such as FPGA-as-a-Service (FaaS). Hence, reusing the existing infrastructure can greatly facilitate the implementation of FHE on the cloud.In this paper, we present TurboHE, the first hardware accelerator for FHE operations based on an FPGA cluster. TurboHE aims to boost the performance of CKKS, one of the fastest FHE schemes which is most suitable to machine learning applications, by accelerating its computationally intensive and frequently used operation: relinearization. The proposed scalable architecture based on hardware partitioning can be easily configured to accommodate high acceleration requirements for relinearization with very large CKKS parameters. As a demonstration, an implementation, which supports 32,768 polynomial coefficients and a coefficient bitwidth of 594 decomposed into 11 Residue Number System (RNS) components, was deployed on a cluster consisting of 9 Xilinx VU13P FPGAs. The cluster operated at 200 MHz and achieved 1096 times throughput compared with a single threaded CPU implementation. Moreover, the low level hardware components implemented in this work such as the NTT module can also be applied to accelerate other lattice-based cryptography schemes.\",\"PeriodicalId\":343684,\"journal\":{\"name\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS54959.2023.00084\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着各个领域对云计算的需求不断增长，以及对敏感数据暴露的日益关注，完全同态加密(Fully Homomorphic Encryption, FHE)作为一种潜在的隐私保护解决方案越来越受欢迎。通过直接对密文(加密数据)进行计算而不进行解密，FHE可以在不损害隐私的情况下保证数据在整个生命周期中的安全性。然而，FHE方案的速度慢得令人难以忍受，这使得它在实际应用中不切实际。因此，硬件加速器可以缓解这个问题。在各种硬件平台中，FPGA集群特别有前途，因为它们在许多云提供商(如FPGA即服务(FaaS))中具有灵活性和现成的可用性。因此，重用现有的基础设施可以极大地促进FHE在云上的实现。在本文中，我们提出了TurboHE，第一个基于FPGA集群的FHE操作硬件加速器。TurboHE旨在提高CKKS的性能，CKKS是最适合机器学习应用的最快的FHE方案之一，通过加速其计算密集型和频繁使用的操作:线性化。提出的基于硬件分区的可扩展架构可以很容易地配置，以适应具有非常大的CKKS参数的线性化的高加速要求。作为演示，在由9个Xilinx VU13P fpga组成的集群上部署了一个支持32,768个多项式系数和594个系数位宽分解为11个残余数系统(RNS)组件的实现。与单线程CPU实现相比，集群在200 MHz下运行，实现了1096倍的吞吐量。此外，在本工作中实现的底层硬件组件，如NTT模块，也可以应用于加速其他基于点阵的加密方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TurboHE: Accelerating Fully Homomorphic Encryption Using FPGA Clusters

With the burgeoning demands for cloud computing in various fields followed by the rising attention to sensitive data exposure, Fully Homomorphic Encryption (FHE) is gaining popularity as a potential solution to privacy protection. By performing computations directly on the ciphertext (encrypted data) without decrypting it, FHE can guarantee the security of data throughout its lifecycle without compromising the privacy. However, the excruciatingly slow speed of FHE scheme makes adopting it impractical in real life applications. Therefore, hardware accelerators come to the rescue to mitigate the problem. Among various hardware platforms, FPGA clusters are particularly promising because of their flexibility and ready availability at many cloud providers such as FPGA-as-a-Service (FaaS). Hence, reusing the existing infrastructure can greatly facilitate the implementation of FHE on the cloud.In this paper, we present TurboHE, the first hardware accelerator for FHE operations based on an FPGA cluster. TurboHE aims to boost the performance of CKKS, one of the fastest FHE schemes which is most suitable to machine learning applications, by accelerating its computationally intensive and frequently used operation: relinearization. The proposed scalable architecture based on hardware partitioning can be easily configured to accommodate high acceleration requirements for relinearization with very large CKKS parameters. As a demonstration, an implementation, which supports 32,768 polynomial coefficients and a coefficient bitwidth of 594 decomposed into 11 Residue Number System (RNS) components, was deployed on a cluster consisting of 9 Xilinx VU13P FPGAs. The cluster operated at 200 MHz and achieved 1096 times throughput compared with a single threaded CPU implementation. Moreover, the low level hardware components implemented in this work such as the NTT module can also be applied to accelerate other lattice-based cryptography schemes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量