Kubernetes基础设施中的资源感知GPU调度

PARMA-DITAM@HiPEAC Pub Date : 1900-01-01 DOI:10.4230/OASIcs.PARMA-DITAM.2021.4

Aggelos Ferikoglou, Dimosthenis Masouros, Achilleas Tzenetopoulos, S. Xydis, D. Soudris

{"title":"Kubernetes基础设施中的资源感知GPU调度","authors":"Aggelos Ferikoglou, Dimosthenis Masouros, Achilleas Tzenetopoulos, S. Xydis, D. Soudris","doi":"10.4230/OASIcs.PARMA-DITAM.2021.4","DOIUrl":null,"url":null,"abstract":"Nowadays, there is an ever-increasing number of artificial intelligence inference workloads pushed and executed on the cloud. To effectively serve and manage the computational demands, data center operators have provisioned their infrastructures with accelerators. Specifically for GPUs, support for efficient management lacks, as state-of-the-art schedulers and orchestrators, threat GPUs only as typical compute resources ignoring their unique characteristics and application properties. This phenomenon combined with the GPU over-provisioning problem leads to severe resource under-utilization. Even though prior work has addressed this problem by colocating applications into a single accelerator device, its resource agnostic nature does not manage to face the resource under-utilization and quality of service violations especially for latency critical applications. In this paper, we design a resource aware GPU scheduling framework, able to efficiently colocate applications on the same GPU accelerator card. We integrate our solution with Kubernetes, one of the most widely used cloud orchestration frameworks. We show that our scheduler can achieve 58.8% lower end-to-end job execution time 99%-ile, while delivering 52.5% higher GPU memory usage, 105.9% higher GPU utilization percentage on average and 44.4% lower energy consumption on average, compared to the state-of-the-art schedulers, for a variety of ML representative workloads. 2012 ACM Subject Classification Computing methodologies; Computer systems organization → Cloud computing; Computer systems organization → Heterogeneous (hybrid) systems; Hardware → Emerging architectures","PeriodicalId":436349,"journal":{"name":"PARMA-DITAM@HiPEAC","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Resource Aware GPU Scheduling in Kubernetes Infrastructure\",\"authors\":\"Aggelos Ferikoglou, Dimosthenis Masouros, Achilleas Tzenetopoulos, S. Xydis, D. Soudris\",\"doi\":\"10.4230/OASIcs.PARMA-DITAM.2021.4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, there is an ever-increasing number of artificial intelligence inference workloads pushed and executed on the cloud. To effectively serve and manage the computational demands, data center operators have provisioned their infrastructures with accelerators. Specifically for GPUs, support for efficient management lacks, as state-of-the-art schedulers and orchestrators, threat GPUs only as typical compute resources ignoring their unique characteristics and application properties. This phenomenon combined with the GPU over-provisioning problem leads to severe resource under-utilization. Even though prior work has addressed this problem by colocating applications into a single accelerator device, its resource agnostic nature does not manage to face the resource under-utilization and quality of service violations especially for latency critical applications. In this paper, we design a resource aware GPU scheduling framework, able to efficiently colocate applications on the same GPU accelerator card. We integrate our solution with Kubernetes, one of the most widely used cloud orchestration frameworks. We show that our scheduler can achieve 58.8% lower end-to-end job execution time 99%-ile, while delivering 52.5% higher GPU memory usage, 105.9% higher GPU utilization percentage on average and 44.4% lower energy consumption on average, compared to the state-of-the-art schedulers, for a variety of ML representative workloads. 2012 ACM Subject Classification Computing methodologies; Computer systems organization → Cloud computing; Computer systems organization → Heterogeneous (hybrid) systems; Hardware → Emerging architectures\",\"PeriodicalId\":436349,\"journal\":{\"name\":\"PARMA-DITAM@HiPEAC\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PARMA-DITAM@HiPEAC\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PARMA-DITAM@HiPEAC","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

如今，在云中推送和执行的人工智能推理工作负载数量不断增加。为了有效地服务和管理计算需求，数据中心运营商为其基础设施配备了加速器。特别是对于gpu，缺乏对高效管理的支持，作为最先进的调度器和编排器，威胁gpu仅作为典型的计算资源，而忽略了其独特的特性和应用程序属性。这种现象与GPU过度配置问题相结合，导致严重的资源利用率不足。尽管以前的工作已经通过将应用程序配置到单个加速器设备中来解决了这个问题，但其资源不确定的性质并不能解决资源利用率不足和服务质量违规的问题，特别是对于延迟关键型应用程序。在本文中，我们设计了一个资源感知的GPU调度框架，能够有效地在同一GPU加速卡上配置应用程序。我们将我们的解决方案与Kubernetes集成，Kubernetes是使用最广泛的云编排框架之一。我们表明，对于各种ML代表性工作负载，与最先进的调度器相比，我们的调度器可以将端到端作业执行时间降低58.8%(99%)，同时将GPU内存使用率提高52.5%，GPU利用率平均提高105.9%，平均能耗降低44.4%。2012 ACM主题分类计算方法;计算机系统组织→云计算;计算机系统组织→异构(混合)系统;硬件→新兴架构

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Resource Aware GPU Scheduling in Kubernetes Infrastructure

Nowadays, there is an ever-increasing number of artificial intelligence inference workloads pushed and executed on the cloud. To effectively serve and manage the computational demands, data center operators have provisioned their infrastructures with accelerators. Specifically for GPUs, support for efficient management lacks, as state-of-the-art schedulers and orchestrators, threat GPUs only as typical compute resources ignoring their unique characteristics and application properties. This phenomenon combined with the GPU over-provisioning problem leads to severe resource under-utilization. Even though prior work has addressed this problem by colocating applications into a single accelerator device, its resource agnostic nature does not manage to face the resource under-utilization and quality of service violations especially for latency critical applications. In this paper, we design a resource aware GPU scheduling framework, able to efficiently colocate applications on the same GPU accelerator card. We integrate our solution with Kubernetes, one of the most widely used cloud orchestration frameworks. We show that our scheduler can achieve 58.8% lower end-to-end job execution time 99%-ile, while delivering 52.5% higher GPU memory usage, 105.9% higher GPU utilization percentage on average and 44.4% lower energy consumption on average, compared to the state-of-the-art schedulers, for a variety of ML representative workloads. 2012 ACM Subject Classification Computing methodologies; Computer systems organization → Cloud computing; Computer systems organization → Heterogeneous (hybrid) systems; Hardware → Emerging architectures

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PARMA-DITAM@HiPEAC

自引率

0.00%

发文量