Machine Learning at the Edge: Efficient Utilization of Limited CPU/GPU Resources by Multiplexing

2020 IEEE 28th International Conference on Network Protocols (ICNP) Pub Date : 2020-10-13 DOI:10.1109/ICNP49622.2020.9259361

Aditya Dhakal, Sameer G. Kulkarni, K. Ramakrishnan

{"title":"Machine Learning at the Edge: Efficient Utilization of Limited CPU/GPU Resources by Multiplexing","authors":"Aditya Dhakal, Sameer G. Kulkarni, K. Ramakrishnan","doi":"10.1109/ICNP49622.2020.9259361","DOIUrl":null,"url":null,"abstract":"Edge clouds can provide very responsive services for end-user devices that require more significant compute capabilities than they have. But edge cloud resources such as CPUs and accelerators such as GPUs are limited and must be shared across multiple concurrently running clients. However, multiplexing GPUs across applications is challenging. Further, edge servers are likely to require considerable amounts of streaming data to be processed. Getting that data from the network stream to the GPU can be a bottleneck, limiting the amount of work GPUs do. Finally, the lack of prompt notification of job completion from GPU also results in ineffective GPU utilization. We propose a framework that addresses these challenges in the following manner. We utilize spatial sharing of GPUs to multiplex the GPU more efficiently. While spatial sharing of GPU can increase GPU utilization, the uncontrolled spatial sharing currently available with state-of-the-art systems such as CUDA-MPS can cause interference between applications, resulting in unpredictable latency. Our framework utilizes controlled spatial sharing of GPU, which limits the interference across applications. Our framework uses the GPU DMA engine to offload data transfer to GPU, therefore preventing CPU from being bottleneck while transferring data from the network to GPU. Our framework uses the CUDA event library to have timely, low overhead GPU notifications. Preliminary experiments show that we can achieve low DNN inference latency and improve DNN inference throughput by a factor of ~ 1.4.","PeriodicalId":233856,"journal":{"name":"2020 IEEE 28th International Conference on Network Protocols (ICNP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 28th International Conference on Network Protocols (ICNP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNP49622.2020.9259361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Edge clouds can provide very responsive services for end-user devices that require more significant compute capabilities than they have. But edge cloud resources such as CPUs and accelerators such as GPUs are limited and must be shared across multiple concurrently running clients. However, multiplexing GPUs across applications is challenging. Further, edge servers are likely to require considerable amounts of streaming data to be processed. Getting that data from the network stream to the GPU can be a bottleneck, limiting the amount of work GPUs do. Finally, the lack of prompt notification of job completion from GPU also results in ineffective GPU utilization. We propose a framework that addresses these challenges in the following manner. We utilize spatial sharing of GPUs to multiplex the GPU more efficiently. While spatial sharing of GPU can increase GPU utilization, the uncontrolled spatial sharing currently available with state-of-the-art systems such as CUDA-MPS can cause interference between applications, resulting in unpredictable latency. Our framework utilizes controlled spatial sharing of GPU, which limits the interference across applications. Our framework uses the GPU DMA engine to offload data transfer to GPU, therefore preventing CPU from being bottleneck while transferring data from the network to GPU. Our framework uses the CUDA event library to have timely, low overhead GPU notifications. Preliminary experiments show that we can achieve low DNN inference latency and improve DNN inference throughput by a factor of ~ 1.4.

查看原文本刊更多论文

边缘机器学习:通过多路复用有效利用有限的CPU/GPU资源

边缘云可以为终端用户设备提供响应非常快的服务，这些设备需要比它们拥有的更重要的计算能力。但是边缘云资源(如cpu和gpu等加速器)是有限的，必须在多个并发运行的客户端之间共享。然而，跨应用程序复用gpu是具有挑战性的。此外，边缘服务器可能需要处理大量的流数据。从网络流获取数据到GPU可能是一个瓶颈，限制了GPU的工作量。最后，缺少来自GPU的作业完成的及时通知也导致GPU利用率低下。我们提出一个框架，以以下方式应对这些挑战。我们利用GPU的空间共享来提高GPU的复用效率。虽然GPU的空间共享可以提高GPU的利用率，但目前在CUDA-MPS等最先进的系统中，不受控制的空间共享可能会导致应用程序之间的干扰，从而导致不可预测的延迟。我们的框架利用GPU的可控空间共享，限制了应用程序之间的干扰。我们的框架使用GPU DMA引擎将数据传输卸载到GPU，从而防止CPU在将数据从网络传输到GPU时成为瓶颈。我们的框架使用CUDA事件库来提供及时、低开销的GPU通知。初步实验表明，我们可以实现较低的DNN推理延迟，并将DNN推理吞吐量提高约1.4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 28th International Conference on Network Protocols (ICNP)

自引率

0.00%

发文量