深度学习推理即服务的高效架构范式

2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC) Pub Date : 2020-11-06 DOI:10.1109/IPCCC50635.2020.9391551

Jin Yu, Xiaopeng Ke, Fengyuan Xu, Hao Huang

{"title":"深度学习推理即服务的高效架构范式","authors":"Jin Yu, Xiaopeng Ke, Fengyuan Xu, Hao Huang","doi":"10.1109/IPCCC50635.2020.9391551","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) inference has been broadly used and shown excellent performance in many intelligent applications. Unfortunately, the high resource consumption and training efforts of sophisticated models present obstacles for regular users to enjoy it. Thus, Deep Learning Inference as a Service (DIaaS), offering online inference services on cloud, has earned great popularity among cloud tenants who can send their DIaaS inputs via RPCs across the internal network. However, such detached architecture paradigm is inappropriate to DIaaS because the high-dimensional inputs of DIaaS consume a lot of precious internal bandwidth and the service latency of DIaaS has to be low and stable. We therefore propose a novel architecture paradigm on cloud for DIaaS in order to address the above two problems without giving up the security and maintenance benefits. We first leverage the SGX technology, a strongly-protected user space enclave, to bring DIaaS computation to its input source as close as possible, i.e. co-locating a cloud tenant and its subscribed DIaaS in the same virtual machine. When the GPU acceleration is needed, we migrate this virtual machine to any available GPU host and transparently utilize the GPU via our backend computing stack installed on it. In this way the majority of internal bandwidth is saved compared to traditional paradigm. Furthermore, we greatly improve the efficiency of the proposed architecture paradigm, from the computation and I/O perspectives, by making the entire data flow more DL-oriented. Finally, we implement a prototype system and evaluate it in real-world scenarios. The experiments show that our locality-aware architecture achieves the average single CPU (GPU) based deep learning inference time 2.84X (4.87X) less than the traditional detached architecture on average.","PeriodicalId":226034,"journal":{"name":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Architecture Paradigm for Deep Learning Inference as a Service\",\"authors\":\"Jin Yu, Xiaopeng Ke, Fengyuan Xu, Hao Huang\",\"doi\":\"10.1109/IPCCC50635.2020.9391551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning (DL) inference has been broadly used and shown excellent performance in many intelligent applications. Unfortunately, the high resource consumption and training efforts of sophisticated models present obstacles for regular users to enjoy it. Thus, Deep Learning Inference as a Service (DIaaS), offering online inference services on cloud, has earned great popularity among cloud tenants who can send their DIaaS inputs via RPCs across the internal network. However, such detached architecture paradigm is inappropriate to DIaaS because the high-dimensional inputs of DIaaS consume a lot of precious internal bandwidth and the service latency of DIaaS has to be low and stable. We therefore propose a novel architecture paradigm on cloud for DIaaS in order to address the above two problems without giving up the security and maintenance benefits. We first leverage the SGX technology, a strongly-protected user space enclave, to bring DIaaS computation to its input source as close as possible, i.e. co-locating a cloud tenant and its subscribed DIaaS in the same virtual machine. When the GPU acceleration is needed, we migrate this virtual machine to any available GPU host and transparently utilize the GPU via our backend computing stack installed on it. In this way the majority of internal bandwidth is saved compared to traditional paradigm. Furthermore, we greatly improve the efficiency of the proposed architecture paradigm, from the computation and I/O perspectives, by making the entire data flow more DL-oriented. Finally, we implement a prototype system and evaluate it in real-world scenarios. The experiments show that our locality-aware architecture achieves the average single CPU (GPU) based deep learning inference time 2.84X (4.87X) less than the traditional detached architecture on average.\",\"PeriodicalId\":226034,\"journal\":{\"name\":\"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPCCC50635.2020.9391551\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPCCC50635.2020.9391551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

深度学习推理在许多智能应用中得到了广泛的应用并表现出优异的性能。不幸的是，高资源消耗和复杂模型的训练努力为普通用户提供了享受它的障碍。因此，深度学习推理即服务(DIaaS)在云上提供在线推理服务，在云租户中非常受欢迎，他们可以通过内部网络中的rpc发送他们的DIaaS输入。然而，由于DIaaS的高维输入消耗了大量宝贵的内部带宽，并且DIaaS的服务延迟必须低而稳定，因此这种分离架构范式不适用于DIaaS。因此，为了在不放弃安全性和维护优势的情况下解决上述两个问题，我们在云上为DIaaS提出了一种新的架构范例。我们首先利用SGX技术(一个受到强大保护的用户空间飞地)，将DIaaS计算尽可能地靠近其输入源，即在同一虚拟机中共同定位云租户及其订阅的DIaaS。当需要GPU加速时，我们将该虚拟机迁移到任何可用的GPU主机上，并通过安装在其上的后端计算堆栈透明地使用GPU。与传统模式相比，这种方式节省了大部分内部带宽。此外，我们通过使整个数据流更加面向dl，从计算和I/O的角度大大提高了所提出的体系结构范例的效率。最后，我们实现了一个原型系统，并在实际场景中对其进行了评估。实验表明，我们的位置感知架构实现的基于单个CPU (GPU)的深度学习推理时间平均比传统的分离架构少2.84倍(4.87倍)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Architecture Paradigm for Deep Learning Inference as a Service

Deep learning (DL) inference has been broadly used and shown excellent performance in many intelligent applications. Unfortunately, the high resource consumption and training efforts of sophisticated models present obstacles for regular users to enjoy it. Thus, Deep Learning Inference as a Service (DIaaS), offering online inference services on cloud, has earned great popularity among cloud tenants who can send their DIaaS inputs via RPCs across the internal network. However, such detached architecture paradigm is inappropriate to DIaaS because the high-dimensional inputs of DIaaS consume a lot of precious internal bandwidth and the service latency of DIaaS has to be low and stable. We therefore propose a novel architecture paradigm on cloud for DIaaS in order to address the above two problems without giving up the security and maintenance benefits. We first leverage the SGX technology, a strongly-protected user space enclave, to bring DIaaS computation to its input source as close as possible, i.e. co-locating a cloud tenant and its subscribed DIaaS in the same virtual machine. When the GPU acceleration is needed, we migrate this virtual machine to any available GPU host and transparently utilize the GPU via our backend computing stack installed on it. In this way the majority of internal bandwidth is saved compared to traditional paradigm. Furthermore, we greatly improve the efficiency of the proposed architecture paradigm, from the computation and I/O perspectives, by making the entire data flow more DL-oriented. Finally, we implement a prototype system and evaluate it in real-world scenarios. The experiments show that our locality-aware architecture achieves the average single CPU (GPU) based deep learning inference time 2.84X (4.87X) less than the traditional detached architecture on average.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)

自引率

0.00%

发文量