Bandwidth Efficient Homomorphic Encrypted Matrix Vector Multiplication Accelerator on FPGA

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI:10.1109/ICFPT56656.2022.9974369

Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna

{"title":"Bandwidth Efficient Homomorphic Encrypted Matrix Vector Multiplication Accelerator on FPGA","authors":"Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna","doi":"10.1109/ICFPT56656.2022.9974369","DOIUrl":null,"url":null,"abstract":"Homomorphic Encryption (HE) is a promising solution to the increasing concerns of privacy in Machine Learning (ML) as it enables computations directly on encrypted data. However, it imposes significant overhead on the compute system and remains impractically slow. Prior works have proposed efficient FPGA implementations of basic HE primitives such as number theoretic transform (NTT), key switching, etc. Composing the primitives together to realize higher level ML computation is still a challenge due to the large data transfer overhead. In this work, we propose an efficient FPGA implementation of HE Matrix Vector Multiplication $(\\mathbf{M}\\times \\mathbf{V})$, a key kernel in HE-based Machine Learning applications. By analyzing the data reuse characteristics and the encryption overhead of HE $\\mathbf{M}\\times \\mathbf{V}$, we show that simply using the principles of unencrypted $\\mathbf{M}\\times \\mathbf{V}$ to design accelerators for HE $\\mathbf{M}\\times \\mathbf{V}$ can lead to a significant amount of DRAM data transfers. We tackle the computation and data transfer challenges by proposing a bandwidth efficient dataflow that is specially optimized for HE $\\mathbf{M}\\times \\mathbf{V}$. We identify highly reused data entities in HE $\\mathbf{M}\\times \\mathbf{V}$ and efficiently utilize the on-chip SRAM to reduce the DRAM data transfers. To speed up the computation of HE $\\mathbf{M}\\times \\mathbf{V}$, we exploit three types of parallelism: partial sum parallelism, residual polynomial parallelism and coefficient parallelism. Leveraging these innovations, we demonstrate the first FPGA accelerator for HE matrix vector multiplication. Evaluation on 7 HE $\\mathbf{M}\\times \\mathbf{V}$ benchmarks shows that our FPGA accelerator is up to $3.8\\times$ (GeoMean $2.8\\times$) faster compared to the 64-thread CPU implementation.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT56656.2022.9974369","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Homomorphic Encryption (HE) is a promising solution to the increasing concerns of privacy in Machine Learning (ML) as it enables computations directly on encrypted data. However, it imposes significant overhead on the compute system and remains impractically slow. Prior works have proposed efficient FPGA implementations of basic HE primitives such as number theoretic transform (NTT), key switching, etc. Composing the primitives together to realize higher level ML computation is still a challenge due to the large data transfer overhead. In this work, we propose an efficient FPGA implementation of HE Matrix Vector Multiplication $(\mathbf{M}\times \mathbf{V})$, a key kernel in HE-based Machine Learning applications. By analyzing the data reuse characteristics and the encryption overhead of HE $\mathbf{M}\times \mathbf{V}$, we show that simply using the principles of unencrypted $\mathbf{M}\times \mathbf{V}$ to design accelerators for HE $\mathbf{M}\times \mathbf{V}$ can lead to a significant amount of DRAM data transfers. We tackle the computation and data transfer challenges by proposing a bandwidth efficient dataflow that is specially optimized for HE $\mathbf{M}\times \mathbf{V}$. We identify highly reused data entities in HE $\mathbf{M}\times \mathbf{V}$ and efficiently utilize the on-chip SRAM to reduce the DRAM data transfers. To speed up the computation of HE $\mathbf{M}\times \mathbf{V}$, we exploit three types of parallelism: partial sum parallelism, residual polynomial parallelism and coefficient parallelism. Leveraging these innovations, we demonstrate the first FPGA accelerator for HE matrix vector multiplication. Evaluation on 7 HE $\mathbf{M}\times \mathbf{V}$ benchmarks shows that our FPGA accelerator is up to $3.8\times$ (GeoMean $2.8\times$) faster compared to the 64-thread CPU implementation.

查看原文本刊更多论文

基于FPGA的带宽高效同态加密矩阵矢量乘法加速器

同态加密(HE)是一种很有前途的解决方案，可以解决机器学习(ML)中日益增长的隐私问题，因为它可以直接对加密数据进行计算。然而，它给计算系统带来了巨大的开销，并且仍然非常慢。先前的工作已经提出了有效的FPGA实现基本的HE原语，如数论变换(NTT)，密钥交换等。由于庞大的数据传输开销，将原语组合在一起以实现更高级别的ML计算仍然是一个挑战。在这项工作中，我们提出了一种高效的FPGA实现HE矩阵向量乘法$(\mathbf{M}\times \mathbf{V})$，这是基于HE的机器学习应用的关键内核。通过分析数据重用特征和HE $\mathbf{M}\times \mathbf{V}$的加密开销，我们表明，简单地使用未加密的$\mathbf{M}\times \mathbf{V}$的原理来设计HE $\mathbf{M}\times \mathbf{V}$的加速器可以导致大量的DRAM数据传输。我们通过提出一种带宽高效的数据流来解决计算和数据传输方面的挑战，该数据流专门针对HE $\mathbf{M}\乘以\mathbf{V}$进行了优化。我们在HE $\mathbf{M}\次\mathbf{V}$中识别高度重复使用的数据实体，并有效利用片上SRAM来减少DRAM数据传输。为了加快HE $\mathbf{M}\乘以\mathbf{V}$的计算速度，我们利用了三种并行性:部分和并行性、残差多项式并行性和系数并行性。利用这些创新，我们展示了用于HE矩阵矢量乘法的第一个FPGA加速器。对7个HE $\mathbf{M}\times \mathbf{V}$基准测试的评估表明，与64线程CPU实现相比，我们的FPGA加速器速度高达$3.8\倍$ (GeoMean $2.8\倍$)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量