深度学习推理即服务的高效架构范式

Jin Yu, Xiaopeng Ke, Fengyuan Xu, Hao Huang
{"title":"深度学习推理即服务的高效架构范式","authors":"Jin Yu, Xiaopeng Ke, Fengyuan Xu, Hao Huang","doi":"10.1109/IPCCC50635.2020.9391551","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) inference has been broadly used and shown excellent performance in many intelligent applications. Unfortunately, the high resource consumption and training efforts of sophisticated models present obstacles for regular users to enjoy it. Thus, Deep Learning Inference as a Service (DIaaS), offering online inference services on cloud, has earned great popularity among cloud tenants who can send their DIaaS inputs via RPCs across the internal network. However, such detached architecture paradigm is inappropriate to DIaaS because the high-dimensional inputs of DIaaS consume a lot of precious internal bandwidth and the service latency of DIaaS has to be low and stable. We therefore propose a novel architecture paradigm on cloud for DIaaS in order to address the above two problems without giving up the security and maintenance benefits. We first leverage the SGX technology, a strongly-protected user space enclave, to bring DIaaS computation to its input source as close as possible, i.e. co-locating a cloud tenant and its subscribed DIaaS in the same virtual machine. When the GPU acceleration is needed, we migrate this virtual machine to any available GPU host and transparently utilize the GPU via our backend computing stack installed on it. In this way the majority of internal bandwidth is saved compared to traditional paradigm. Furthermore, we greatly improve the efficiency of the proposed architecture paradigm, from the computation and I/O perspectives, by making the entire data flow more DL-oriented. Finally, we implement a prototype system and evaluate it in real-world scenarios. The experiments show that our locality-aware architecture achieves the average single CPU (GPU) based deep learning inference time 2.84X (4.87X) less than the traditional detached architecture on average.","PeriodicalId":226034,"journal":{"name":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Architecture Paradigm for Deep Learning Inference as a Service\",\"authors\":\"Jin Yu, Xiaopeng Ke, Fengyuan Xu, Hao Huang\",\"doi\":\"10.1109/IPCCC50635.2020.9391551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning (DL) inference has been broadly used and shown excellent performance in many intelligent applications. Unfortunately, the high resource consumption and training efforts of sophisticated models present obstacles for regular users to enjoy it. Thus, Deep Learning Inference as a Service (DIaaS), offering online inference services on cloud, has earned great popularity among cloud tenants who can send their DIaaS inputs via RPCs across the internal network. However, such detached architecture paradigm is inappropriate to DIaaS because the high-dimensional inputs of DIaaS consume a lot of precious internal bandwidth and the service latency of DIaaS has to be low and stable. We therefore propose a novel architecture paradigm on cloud for DIaaS in order to address the above two problems without giving up the security and maintenance benefits. We first leverage the SGX technology, a strongly-protected user space enclave, to bring DIaaS computation to its input source as close as possible, i.e. co-locating a cloud tenant and its subscribed DIaaS in the same virtual machine. When the GPU acceleration is needed, we migrate this virtual machine to any available GPU host and transparently utilize the GPU via our backend computing stack installed on it. In this way the majority of internal bandwidth is saved compared to traditional paradigm. Furthermore, we greatly improve the efficiency of the proposed architecture paradigm, from the computation and I/O perspectives, by making the entire data flow more DL-oriented. Finally, we implement a prototype system and evaluate it in real-world scenarios. The experiments show that our locality-aware architecture achieves the average single CPU (GPU) based deep learning inference time 2.84X (4.87X) less than the traditional detached architecture on average.\",\"PeriodicalId\":226034,\"journal\":{\"name\":\"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPCCC50635.2020.9391551\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPCCC50635.2020.9391551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

深度学习推理在许多智能应用中得到了广泛的应用并表现出优异的性能。不幸的是,高资源消耗和复杂模型的训练努力为普通用户提供了享受它的障碍。因此,深度学习推理即服务(DIaaS)在云上提供在线推理服务,在云租户中非常受欢迎,他们可以通过内部网络中的rpc发送他们的DIaaS输入。然而,由于DIaaS的高维输入消耗了大量宝贵的内部带宽,并且DIaaS的服务延迟必须低而稳定,因此这种分离架构范式不适用于DIaaS。因此,为了在不放弃安全性和维护优势的情况下解决上述两个问题,我们在云上为DIaaS提出了一种新的架构范例。我们首先利用SGX技术(一个受到强大保护的用户空间飞地),将DIaaS计算尽可能地靠近其输入源,即在同一虚拟机中共同定位云租户及其订阅的DIaaS。当需要GPU加速时,我们将该虚拟机迁移到任何可用的GPU主机上,并通过安装在其上的后端计算堆栈透明地使用GPU。与传统模式相比,这种方式节省了大部分内部带宽。此外,我们通过使整个数据流更加面向dl,从计算和I/O的角度大大提高了所提出的体系结构范例的效率。最后,我们实现了一个原型系统,并在实际场景中对其进行了评估。实验表明,我们的位置感知架构实现的基于单个CPU (GPU)的深度学习推理时间平均比传统的分离架构少2.84倍(4.87倍)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Efficient Architecture Paradigm for Deep Learning Inference as a Service
Deep learning (DL) inference has been broadly used and shown excellent performance in many intelligent applications. Unfortunately, the high resource consumption and training efforts of sophisticated models present obstacles for regular users to enjoy it. Thus, Deep Learning Inference as a Service (DIaaS), offering online inference services on cloud, has earned great popularity among cloud tenants who can send their DIaaS inputs via RPCs across the internal network. However, such detached architecture paradigm is inappropriate to DIaaS because the high-dimensional inputs of DIaaS consume a lot of precious internal bandwidth and the service latency of DIaaS has to be low and stable. We therefore propose a novel architecture paradigm on cloud for DIaaS in order to address the above two problems without giving up the security and maintenance benefits. We first leverage the SGX technology, a strongly-protected user space enclave, to bring DIaaS computation to its input source as close as possible, i.e. co-locating a cloud tenant and its subscribed DIaaS in the same virtual machine. When the GPU acceleration is needed, we migrate this virtual machine to any available GPU host and transparently utilize the GPU via our backend computing stack installed on it. In this way the majority of internal bandwidth is saved compared to traditional paradigm. Furthermore, we greatly improve the efficiency of the proposed architecture paradigm, from the computation and I/O perspectives, by making the entire data flow more DL-oriented. Finally, we implement a prototype system and evaluate it in real-world scenarios. The experiments show that our locality-aware architecture achieves the average single CPU (GPU) based deep learning inference time 2.84X (4.87X) less than the traditional detached architecture on average.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信