Qiushi Li, Ju Ren, Xinglin Pan, Yuezhi Zhou, Yaoxue Zhang
{"title":"ENIGMA:异构神经网络加速器的低延迟和隐私保护边缘推断","authors":"Qiushi Li, Ju Ren, Xinglin Pan, Yuezhi Zhou, Yaoxue Zhang","doi":"10.1109/ICDCS54860.2022.00051","DOIUrl":null,"url":null,"abstract":"Time-efficient artificial intelligence (AI) service has recently witnessed increasing interest from academia and industry due to the urgent needs in massive smart applications such as self-driving cars, virtual reality, high-resolution video streaming, etc. Existing solutions to reduce AI latency, like edge computing and heterogeneous neural-network accelerators (NNAs), face high risk of privacy leakage. To achieve both low-latency and privacy-preserving purposes on edge servers (e.g., NNAs), this paper proposes ENIGMA that can exploit the trusted execution environment (TEE) and heterogeneous NNAs of edge servers for edge inference. The low-latency is supported by a new ahead-of-time analysis framework for analyzing the linearity of multilayer neural networks, which automatically slices forward-graph and assigns sub-graphs to TEE or NNA. To avoid privacy leakage issue, we then introduce a pre-forwarded cipher generation (PFCG) scheme for computing linear sub-forward-graphs on NNA. The input data is encrypted to ciphertext that can be computed directly by linear sub-graphs, and the output can be decrypted to obtain the correct output. To enable non-linear computation of sub-graphs on TEE, we use ring-cache and automatic vectorization optimization to address the memory limitation of TEE. Qualitative analysis and quantitative experiments on GPU, NPU and TPU demonstrate that ENIGMA is not only compatible with heterogeneous NNAs, but also can avoid leakages of private features with latency as low as 50-milliseconds.","PeriodicalId":225883,"journal":{"name":"2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"ENIGMA: Low-Latency and Privacy-Preserving Edge Inference on Heterogeneous Neural Network Accelerators\",\"authors\":\"Qiushi Li, Ju Ren, Xinglin Pan, Yuezhi Zhou, Yaoxue Zhang\",\"doi\":\"10.1109/ICDCS54860.2022.00051\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Time-efficient artificial intelligence (AI) service has recently witnessed increasing interest from academia and industry due to the urgent needs in massive smart applications such as self-driving cars, virtual reality, high-resolution video streaming, etc. Existing solutions to reduce AI latency, like edge computing and heterogeneous neural-network accelerators (NNAs), face high risk of privacy leakage. To achieve both low-latency and privacy-preserving purposes on edge servers (e.g., NNAs), this paper proposes ENIGMA that can exploit the trusted execution environment (TEE) and heterogeneous NNAs of edge servers for edge inference. The low-latency is supported by a new ahead-of-time analysis framework for analyzing the linearity of multilayer neural networks, which automatically slices forward-graph and assigns sub-graphs to TEE or NNA. To avoid privacy leakage issue, we then introduce a pre-forwarded cipher generation (PFCG) scheme for computing linear sub-forward-graphs on NNA. The input data is encrypted to ciphertext that can be computed directly by linear sub-graphs, and the output can be decrypted to obtain the correct output. To enable non-linear computation of sub-graphs on TEE, we use ring-cache and automatic vectorization optimization to address the memory limitation of TEE. Qualitative analysis and quantitative experiments on GPU, NPU and TPU demonstrate that ENIGMA is not only compatible with heterogeneous NNAs, but also can avoid leakages of private features with latency as low as 50-milliseconds.\",\"PeriodicalId\":225883,\"journal\":{\"name\":\"2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS)\",\"volume\":\"84 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS54860.2022.00051\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS54860.2022.00051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ENIGMA: Low-Latency and Privacy-Preserving Edge Inference on Heterogeneous Neural Network Accelerators
Time-efficient artificial intelligence (AI) service has recently witnessed increasing interest from academia and industry due to the urgent needs in massive smart applications such as self-driving cars, virtual reality, high-resolution video streaming, etc. Existing solutions to reduce AI latency, like edge computing and heterogeneous neural-network accelerators (NNAs), face high risk of privacy leakage. To achieve both low-latency and privacy-preserving purposes on edge servers (e.g., NNAs), this paper proposes ENIGMA that can exploit the trusted execution environment (TEE) and heterogeneous NNAs of edge servers for edge inference. The low-latency is supported by a new ahead-of-time analysis framework for analyzing the linearity of multilayer neural networks, which automatically slices forward-graph and assigns sub-graphs to TEE or NNA. To avoid privacy leakage issue, we then introduce a pre-forwarded cipher generation (PFCG) scheme for computing linear sub-forward-graphs on NNA. The input data is encrypted to ciphertext that can be computed directly by linear sub-graphs, and the output can be decrypted to obtain the correct output. To enable non-linear computation of sub-graphs on TEE, we use ring-cache and automatic vectorization optimization to address the memory limitation of TEE. Qualitative analysis and quantitative experiments on GPU, NPU and TPU demonstrate that ENIGMA is not only compatible with heterogeneous NNAs, but also can avoid leakages of private features with latency as low as 50-milliseconds.