Machine Learning at the Edge: Efficient Utilization of Limited CPU/GPU Resources by Multiplexing

Aditya Dhakal, Sameer G. Kulkarni, K. Ramakrishnan
{"title":"Machine Learning at the Edge: Efficient Utilization of Limited CPU/GPU Resources by Multiplexing","authors":"Aditya Dhakal, Sameer G. Kulkarni, K. Ramakrishnan","doi":"10.1109/ICNP49622.2020.9259361","DOIUrl":null,"url":null,"abstract":"Edge clouds can provide very responsive services for end-user devices that require more significant compute capabilities than they have. But edge cloud resources such as CPUs and accelerators such as GPUs are limited and must be shared across multiple concurrently running clients. However, multiplexing GPUs across applications is challenging. Further, edge servers are likely to require considerable amounts of streaming data to be processed. Getting that data from the network stream to the GPU can be a bottleneck, limiting the amount of work GPUs do. Finally, the lack of prompt notification of job completion from GPU also results in ineffective GPU utilization. We propose a framework that addresses these challenges in the following manner. We utilize spatial sharing of GPUs to multiplex the GPU more efficiently. While spatial sharing of GPU can increase GPU utilization, the uncontrolled spatial sharing currently available with state-of-the-art systems such as CUDA-MPS can cause interference between applications, resulting in unpredictable latency. Our framework utilizes controlled spatial sharing of GPU, which limits the interference across applications. Our framework uses the GPU DMA engine to offload data transfer to GPU, therefore preventing CPU from being bottleneck while transferring data from the network to GPU. Our framework uses the CUDA event library to have timely, low overhead GPU notifications. Preliminary experiments show that we can achieve low DNN inference latency and improve DNN inference throughput by a factor of ~ 1.4.","PeriodicalId":233856,"journal":{"name":"2020 IEEE 28th International Conference on Network Protocols (ICNP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 28th International Conference on Network Protocols (ICNP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNP49622.2020.9259361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Edge clouds can provide very responsive services for end-user devices that require more significant compute capabilities than they have. But edge cloud resources such as CPUs and accelerators such as GPUs are limited and must be shared across multiple concurrently running clients. However, multiplexing GPUs across applications is challenging. Further, edge servers are likely to require considerable amounts of streaming data to be processed. Getting that data from the network stream to the GPU can be a bottleneck, limiting the amount of work GPUs do. Finally, the lack of prompt notification of job completion from GPU also results in ineffective GPU utilization. We propose a framework that addresses these challenges in the following manner. We utilize spatial sharing of GPUs to multiplex the GPU more efficiently. While spatial sharing of GPU can increase GPU utilization, the uncontrolled spatial sharing currently available with state-of-the-art systems such as CUDA-MPS can cause interference between applications, resulting in unpredictable latency. Our framework utilizes controlled spatial sharing of GPU, which limits the interference across applications. Our framework uses the GPU DMA engine to offload data transfer to GPU, therefore preventing CPU from being bottleneck while transferring data from the network to GPU. Our framework uses the CUDA event library to have timely, low overhead GPU notifications. Preliminary experiments show that we can achieve low DNN inference latency and improve DNN inference throughput by a factor of ~ 1.4.
边缘机器学习:通过多路复用有效利用有限的CPU/GPU资源
边缘云可以为终端用户设备提供响应非常快的服务,这些设备需要比它们拥有的更重要的计算能力。但是边缘云资源(如cpu和gpu等加速器)是有限的,必须在多个并发运行的客户端之间共享。然而,跨应用程序复用gpu是具有挑战性的。此外,边缘服务器可能需要处理大量的流数据。从网络流获取数据到GPU可能是一个瓶颈,限制了GPU的工作量。最后,缺少来自GPU的作业完成的及时通知也导致GPU利用率低下。我们提出一个框架,以以下方式应对这些挑战。我们利用GPU的空间共享来提高GPU的复用效率。虽然GPU的空间共享可以提高GPU的利用率,但目前在CUDA-MPS等最先进的系统中,不受控制的空间共享可能会导致应用程序之间的干扰,从而导致不可预测的延迟。我们的框架利用GPU的可控空间共享,限制了应用程序之间的干扰。我们的框架使用GPU DMA引擎将数据传输卸载到GPU,从而防止CPU在将数据从网络传输到GPU时成为瓶颈。我们的框架使用CUDA事件库来提供及时、低开销的GPU通知。初步实验表明,我们可以实现较低的DNN推理延迟,并将DNN推理吞吐量提高约1.4倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信