ECML:提高边缘云中的机器学习效率

2020 IEEE 9th International Conference on Cloud Networking (CloudNet) Pub Date : 2020-11-09 DOI:10.1109/CloudNet51028.2020.9335804

Aditya Dhakal, Sameer G. Kulkarni, K. Ramakrishnan

{"title":"ECML:提高边缘云中的机器学习效率","authors":"Aditya Dhakal, Sameer G. Kulkarni, K. Ramakrishnan","doi":"10.1109/CloudNet51028.2020.9335804","DOIUrl":null,"url":null,"abstract":"Edge cloud data centers (Edge) are deployed to provide responsive services to the end-users. Edge can host more powerful CPUs and DNN accelerators such as GPUs and may be used for offloading tasks from end-user devices that require more significant compute capabilities. But Edge resources may also be limited and must be shared across multiple applications that process requests concurrently from several clients. However, multiplexing GPUs across applications is challenging. With edge cloud servers needing to process a lot of streaming and the advent of multi-GPU systems, getting that data from the network to the GPU can be a bottleneck, limiting the amount of work the GPU cluster can do. The lack of prompt notification of job completion from the GPU can also result in poor GPU utilization. We build on our recent work on controlled spatial sharing of a single GPU to expand to support multi-GPU systems and propose a framework that addresses these challenges. Unlike the state-of-the-art uncontrolled spatial sharing currently available with systems such as CUDA-MPS, our controlled spatial sharing approach uses each of the GPU in the cluster efficiently by removing interference between applications, resulting in much better, predictable, inference latency We also use each of the cluster GPU's DMA engines to offload data transfers to the GPU complex, thereby preventing the CPU from being the bottleneck. Finally, our framework uses the CUDA event library to give timely, low overhead GPU notifications. Our evaluations show we can achieve low DNN inference latency and improve DNN inference throughput by at least a factor of 2.","PeriodicalId":156419,"journal":{"name":"2020 IEEE 9th International Conference on Cloud Networking (CloudNet)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"ECML: Improving Efficiency of Machine Learning in Edge Clouds\",\"authors\":\"Aditya Dhakal, Sameer G. Kulkarni, K. Ramakrishnan\",\"doi\":\"10.1109/CloudNet51028.2020.9335804\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Edge cloud data centers (Edge) are deployed to provide responsive services to the end-users. Edge can host more powerful CPUs and DNN accelerators such as GPUs and may be used for offloading tasks from end-user devices that require more significant compute capabilities. But Edge resources may also be limited and must be shared across multiple applications that process requests concurrently from several clients. However, multiplexing GPUs across applications is challenging. With edge cloud servers needing to process a lot of streaming and the advent of multi-GPU systems, getting that data from the network to the GPU can be a bottleneck, limiting the amount of work the GPU cluster can do. The lack of prompt notification of job completion from the GPU can also result in poor GPU utilization. We build on our recent work on controlled spatial sharing of a single GPU to expand to support multi-GPU systems and propose a framework that addresses these challenges. Unlike the state-of-the-art uncontrolled spatial sharing currently available with systems such as CUDA-MPS, our controlled spatial sharing approach uses each of the GPU in the cluster efficiently by removing interference between applications, resulting in much better, predictable, inference latency We also use each of the cluster GPU's DMA engines to offload data transfers to the GPU complex, thereby preventing the CPU from being the bottleneck. Finally, our framework uses the CUDA event library to give timely, low overhead GPU notifications. Our evaluations show we can achieve low DNN inference latency and improve DNN inference throughput by at least a factor of 2.\",\"PeriodicalId\":156419,\"journal\":{\"name\":\"2020 IEEE 9th International Conference on Cloud Networking (CloudNet)\",\"volume\":\"81 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 9th International Conference on Cloud Networking (CloudNet)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CloudNet51028.2020.9335804\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 9th International Conference on Cloud Networking (CloudNet)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudNet51028.2020.9335804","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

部署边缘云数据中心(Edge)，为最终用户提供响应式服务。Edge可以承载更强大的cpu和深度神经网络加速器(如gpu)，并可用于从需要更重要计算能力的终端用户设备上卸载任务。但是Edge资源也可能是有限的，并且必须在多个应用程序之间共享，这些应用程序同时处理来自多个客户端的请求。然而，跨应用程序复用gpu是具有挑战性的。随着边缘云服务器需要处理大量的流，以及多GPU系统的出现，将数据从网络传输到GPU可能是一个瓶颈，限制了GPU集群可以完成的工作量。缺少来自GPU的作业完成的及时通知也会导致GPU利用率低下。我们基于我们最近在单个GPU的受控空间共享方面的工作，扩展到支持多GPU系统，并提出了一个解决这些挑战的框架。与CUDA-MPS等系统目前可用的最先进的不受控制的空间共享不同，我们的受控空间共享方法通过消除应用程序之间的干扰，有效地使用集群中的每个GPU，从而产生更好的，可预测的推理延迟。我们还使用每个集群GPU的DMA引擎来卸载数据传输到GPU complex，从而防止CPU成为瓶颈。最后，我们的框架使用CUDA事件库来提供及时、低开销的GPU通知。我们的评估表明，我们可以实现低DNN推理延迟，并将DNN推理吞吐量提高至少2倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ECML: Improving Efficiency of Machine Learning in Edge Clouds

Edge cloud data centers (Edge) are deployed to provide responsive services to the end-users. Edge can host more powerful CPUs and DNN accelerators such as GPUs and may be used for offloading tasks from end-user devices that require more significant compute capabilities. But Edge resources may also be limited and must be shared across multiple applications that process requests concurrently from several clients. However, multiplexing GPUs across applications is challenging. With edge cloud servers needing to process a lot of streaming and the advent of multi-GPU systems, getting that data from the network to the GPU can be a bottleneck, limiting the amount of work the GPU cluster can do. The lack of prompt notification of job completion from the GPU can also result in poor GPU utilization. We build on our recent work on controlled spatial sharing of a single GPU to expand to support multi-GPU systems and propose a framework that addresses these challenges. Unlike the state-of-the-art uncontrolled spatial sharing currently available with systems such as CUDA-MPS, our controlled spatial sharing approach uses each of the GPU in the cluster efficiently by removing interference between applications, resulting in much better, predictable, inference latency We also use each of the cluster GPU's DMA engines to offload data transfers to the GPU complex, thereby preventing the CPU from being the bottleneck. Finally, our framework uses the CUDA event library to give timely, low overhead GPU notifications. Our evaluations show we can achieve low DNN inference latency and improve DNN inference throughput by at least a factor of 2.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 9th International Conference on Cloud Networking (CloudNet)

自引率

0.00%

发文量