Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna
{"title":"Bandwidth Efficient Homomorphic Encrypted Matrix Vector Multiplication Accelerator on FPGA","authors":"Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna","doi":"10.1109/ICFPT56656.2022.9974369","DOIUrl":null,"url":null,"abstract":"Homomorphic Encryption (HE) is a promising solution to the increasing concerns of privacy in Machine Learning (ML) as it enables computations directly on encrypted data. However, it imposes significant overhead on the compute system and remains impractically slow. Prior works have proposed efficient FPGA implementations of basic HE primitives such as number theoretic transform (NTT), key switching, etc. Composing the primitives together to realize higher level ML computation is still a challenge due to the large data transfer overhead. In this work, we propose an efficient FPGA implementation of HE Matrix Vector Multiplication $(\\mathbf{M}\\times \\mathbf{V})$, a key kernel in HE-based Machine Learning applications. By analyzing the data reuse characteristics and the encryption overhead of HE $\\mathbf{M}\\times \\mathbf{V}$, we show that simply using the principles of unencrypted $\\mathbf{M}\\times \\mathbf{V}$ to design accelerators for HE $\\mathbf{M}\\times \\mathbf{V}$ can lead to a significant amount of DRAM data transfers. We tackle the computation and data transfer challenges by proposing a bandwidth efficient dataflow that is specially optimized for HE $\\mathbf{M}\\times \\mathbf{V}$. We identify highly reused data entities in HE $\\mathbf{M}\\times \\mathbf{V}$ and efficiently utilize the on-chip SRAM to reduce the DRAM data transfers. To speed up the computation of HE $\\mathbf{M}\\times \\mathbf{V}$, we exploit three types of parallelism: partial sum parallelism, residual polynomial parallelism and coefficient parallelism. Leveraging these innovations, we demonstrate the first FPGA accelerator for HE matrix vector multiplication. Evaluation on 7 HE $\\mathbf{M}\\times \\mathbf{V}$ benchmarks shows that our FPGA accelerator is up to $3.8\\times$ (GeoMean $2.8\\times$) faster compared to the 64-thread CPU implementation.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT56656.2022.9974369","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Homomorphic Encryption (HE) is a promising solution to the increasing concerns of privacy in Machine Learning (ML) as it enables computations directly on encrypted data. However, it imposes significant overhead on the compute system and remains impractically slow. Prior works have proposed efficient FPGA implementations of basic HE primitives such as number theoretic transform (NTT), key switching, etc. Composing the primitives together to realize higher level ML computation is still a challenge due to the large data transfer overhead. In this work, we propose an efficient FPGA implementation of HE Matrix Vector Multiplication $(\mathbf{M}\times \mathbf{V})$, a key kernel in HE-based Machine Learning applications. By analyzing the data reuse characteristics and the encryption overhead of HE $\mathbf{M}\times \mathbf{V}$, we show that simply using the principles of unencrypted $\mathbf{M}\times \mathbf{V}$ to design accelerators for HE $\mathbf{M}\times \mathbf{V}$ can lead to a significant amount of DRAM data transfers. We tackle the computation and data transfer challenges by proposing a bandwidth efficient dataflow that is specially optimized for HE $\mathbf{M}\times \mathbf{V}$. We identify highly reused data entities in HE $\mathbf{M}\times \mathbf{V}$ and efficiently utilize the on-chip SRAM to reduce the DRAM data transfers. To speed up the computation of HE $\mathbf{M}\times \mathbf{V}$, we exploit three types of parallelism: partial sum parallelism, residual polynomial parallelism and coefficient parallelism. Leveraging these innovations, we demonstrate the first FPGA accelerator for HE matrix vector multiplication. Evaluation on 7 HE $\mathbf{M}\times \mathbf{V}$ benchmarks shows that our FPGA accelerator is up to $3.8\times$ (GeoMean $2.8\times$) faster compared to the 64-thread CPU implementation.