Enabling Rack-scale Confidential Computing using Heterogeneous Trusted Execution Environment

2020 IEEE Symposium on Security and Privacy (SP) Pub Date : 2020-05-01 DOI:10.1109/SP40000.2020.00054

Jianping Zhu, Rui Hou, Xiaofeng Wang, Wenhao Wang, Jiangfeng Cao, Boyan Zhao, Zhongpu Wang, Yuhui Zhang, Jiameng Ying, Lixin Zhang, Dan Meng

{"title":"Enabling Rack-scale Confidential Computing using Heterogeneous Trusted Execution Environment","authors":"Jianping Zhu, Rui Hou, Xiaofeng Wang, Wenhao Wang, Jiangfeng Cao, Boyan Zhao, Zhongpu Wang, Yuhui Zhang, Jiameng Ying, Lixin Zhang, Dan Meng","doi":"10.1109/SP40000.2020.00054","DOIUrl":null,"url":null,"abstract":"With its huge real-world demands, large-scale confidential computing still cannot be supported by today’s Trusted Execution Environment (TEE), due to the lack of scalable and effective protection of high-throughput accelerators like GPUs, FPGAs, and TPUs etc. Although attempts have been made recently to extend the CPU-like enclave to GPUs, these solutions require change to the CPU or GPU chips, may introduce new security risks due to the side-channel leaks in CPU-GPU communication and are still under the resource constraint of today’s CPU TEE.To address these problems, we present the first Heterogeneous TEE design that can truly support large-scale compute or data intensive (CDI) computing, without any chip-level change. Our approach, called HETEE, is a device for centralized management of all computing units (e.g., GPUs and other accelerators) of a server rack. It is uniquely designed to work with today’s data centres and clouds, leveraging modern resource pooling technologies to dynamically compartmentalize computing tasks, and enforce strong isolation and reduce TCB through hardware support. More specifically, HETEE utilizes the PCIe ExpressFabric to allocate its accelerators to the server node on the same rack for a non-sensitive CDI task, and move them back into a secure enclave in response to the demand for confidential computing. Our design runs a thin TCB stack for security management on a security controller (SC), while leaving a large set of software (e.g., AI runtime, GPU driver, etc.) to the integrated microservers that operate enclaves. An enclaves is physically isolated from others through hardware and verified by the SC at its inception. Its microserver and computing units are restored to a secure state upon termination.We implemented HETEE on a real hardware system, and evaluated it with popular neural network inference and training tasks. Our evaluations show that HETEE can easily support the CDI tasks on the real-world scale and incurred a maximal throughput overhead of 2.17% for inference and 0.95% for training on ResNet152.","PeriodicalId":6849,"journal":{"name":"2020 IEEE Symposium on Security and Privacy (SP)","volume":"2 1","pages":"1450-1465"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Symposium on Security and Privacy (SP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SP40000.2020.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

Abstract

With its huge real-world demands, large-scale confidential computing still cannot be supported by today’s Trusted Execution Environment (TEE), due to the lack of scalable and effective protection of high-throughput accelerators like GPUs, FPGAs, and TPUs etc. Although attempts have been made recently to extend the CPU-like enclave to GPUs, these solutions require change to the CPU or GPU chips, may introduce new security risks due to the side-channel leaks in CPU-GPU communication and are still under the resource constraint of today’s CPU TEE.To address these problems, we present the first Heterogeneous TEE design that can truly support large-scale compute or data intensive (CDI) computing, without any chip-level change. Our approach, called HETEE, is a device for centralized management of all computing units (e.g., GPUs and other accelerators) of a server rack. It is uniquely designed to work with today’s data centres and clouds, leveraging modern resource pooling technologies to dynamically compartmentalize computing tasks, and enforce strong isolation and reduce TCB through hardware support. More specifically, HETEE utilizes the PCIe ExpressFabric to allocate its accelerators to the server node on the same rack for a non-sensitive CDI task, and move them back into a secure enclave in response to the demand for confidential computing. Our design runs a thin TCB stack for security management on a security controller (SC), while leaving a large set of software (e.g., AI runtime, GPU driver, etc.) to the integrated microservers that operate enclaves. An enclaves is physically isolated from others through hardware and verified by the SC at its inception. Its microserver and computing units are restored to a secure state upon termination.We implemented HETEE on a real hardware system, and evaluated it with popular neural network inference and training tasks. Our evaluations show that HETEE can easily support the CDI tasks on the real-world scale and incurred a maximal throughput overhead of 2.17% for inference and 0.95% for training on ResNet152.

查看原文本刊更多论文

使用异构可信执行环境实现机架级机密计算

由于gpu、fpga、tpu等高吞吐量加速器缺乏可扩展性和有效的保护，如今的可信执行环境(Trusted Execution Environment, TEE)仍然无法支持大规模机密计算的巨大现实需求。虽然最近已经尝试将类CPU飞地扩展到GPU，但这些解决方案需要更改CPU或GPU芯片，可能会由于CPU-GPU通信中的侧信道泄漏而引入新的安全风险，并且仍然处于今天CPU TEE的资源限制之下。为了解决这些问题，我们提出了第一个异构TEE设计，它可以真正支持大规模计算或数据密集型(CDI)计算，而无需进行任何芯片级更改。我们的方法，称为HETEE，是一种集中管理服务器机架上所有计算单元(例如，gpu和其他加速器)的设备。它的独特设计是为了与当今的数据中心和云一起工作，利用现代资源池技术动态划分计算任务，并通过硬件支持强制执行强隔离和减少TCB。更具体地说，HETEE利用PCIe ExpressFabric将其加速器分配到同一机架上的服务器节点上，以执行非敏感的CDI任务，并将它们移回安全飞地以响应机密计算的需求。我们的设计在安全控制器(SC)上运行一个用于安全管理的薄TCB堆栈，同时将大量软件(例如，AI运行时，GPU驱动程序等)留给操作飞地的集成微服务器。飞地通过硬件在物理上与其他飞地隔离，并在其开始时由SC进行验证。它的微服务器和计算单元在终止时恢复到安全状态。我们在一个真实的硬件系统上实现了HETEE，并用流行的神经网络推理和训练任务对其进行了评估。我们的评估表明，HETEE可以很容易地支持现实世界规模的CDI任务，并且在ResNet152上产生的最大吞吐量开销为2.17%的推理和0.95%的训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE Symposium on Security and Privacy (SP)

自引率

0.00%

发文量