HPC和数据密集型工作负载的混合资源管理

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-01 DOI:10.1109/CCGRID.2019.00054

Abel Souza, Mohamad Rezaei, E. Laure, Johan Tordsson

{"title":"HPC和数据密集型工作负载的混合资源管理","authors":"Abel Souza, Mohamad Rezaei, E. Laure, Johan Tordsson","doi":"10.1109/CCGRID.2019.00054","DOIUrl":null,"url":null,"abstract":"High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed on separate clusters using different tools for resource and application management. With increasing convergence, where modern applications are composed of both types of jobs in complex workflows, this separation becomes a growing overhead and the need for a common platform increases. Executing both workload classes on the same clusters not only enables hybrid workflows, but can also increase system efficiency, as available hardware often is not fully utilized by applications. While HPC systems are typically managed in a coarse grained fashion, with exclusive resource allocations, DI systems employ a finer grained regime, enabling dynamic allocation and control based on application needs. On the path to full convergence, a useful and less intrusive step is a hybrid resource management system allowing the execution of DI applications on top of standard HPC scheduling systems. In this paper we present the architecture of a hybrid system enabling dual-level scheduling for DI jobs in HPC infrastructures. Our system takes advantage of real-time resource profiling to efficiently co-schedule HPC and DI applications. The architecture is easily extensible to current and new types of distributed applications, allowing efficient combination of hybrid workloads on HPC resources with increased job throughput and higher overall resource utilization. The implementation is based on the Slurm and Mesos resource managers for HPC and DI jobs. Experimental evaluations in a real cluster based on a set of representative HPC and DI applications demonstrate that our hybrid architecture improves resource utilization by 20%, with 12% decrease on queue makespan while still meeting all deadlines for HPC jobs.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Hybrid Resource Management for HPC and Data Intensive Workloads\",\"authors\":\"Abel Souza, Mohamad Rezaei, E. Laure, Johan Tordsson\",\"doi\":\"10.1109/CCGRID.2019.00054\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed on separate clusters using different tools for resource and application management. With increasing convergence, where modern applications are composed of both types of jobs in complex workflows, this separation becomes a growing overhead and the need for a common platform increases. Executing both workload classes on the same clusters not only enables hybrid workflows, but can also increase system efficiency, as available hardware often is not fully utilized by applications. While HPC systems are typically managed in a coarse grained fashion, with exclusive resource allocations, DI systems employ a finer grained regime, enabling dynamic allocation and control based on application needs. On the path to full convergence, a useful and less intrusive step is a hybrid resource management system allowing the execution of DI applications on top of standard HPC scheduling systems. In this paper we present the architecture of a hybrid system enabling dual-level scheduling for DI jobs in HPC infrastructures. Our system takes advantage of real-time resource profiling to efficiently co-schedule HPC and DI applications. The architecture is easily extensible to current and new types of distributed applications, allowing efficient combination of hybrid workloads on HPC resources with increased job throughput and higher overall resource utilization. The implementation is based on the Slurm and Mesos resource managers for HPC and DI jobs. Experimental evaluations in a real cluster based on a set of representative HPC and DI applications demonstrate that our hybrid architecture improves resource utilization by 20%, with 12% decrease on queue makespan while still meeting all deadlines for HPC jobs.\",\"PeriodicalId\":234571,\"journal\":{\"name\":\"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2019.00054\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

高性能计算(HPC)和数据密集型(DI)工作负载在不同的集群上执行，使用不同的工具进行资源和应用程序管理。随着融合的增加，现代应用程序在复杂的工作流中由两种类型的作业组成，这种分离的开销越来越大，对公共平台的需求也在增加。在相同的集群上执行这两个工作负载类不仅可以实现混合工作流，而且还可以提高系统效率，因为可用的硬件通常不能被应用程序充分利用。HPC系统通常以粗粒度的方式进行管理，具有排他的资源分配，而DI系统采用更细粒度的方式，支持基于应用程序需求的动态分配和控制。在实现完全融合的道路上，一个有用且较少干扰的步骤是一个混合资源管理系统，允许在标准HPC调度系统之上执行DI应用程序。在本文中，我们提出了一种混合系统的体系结构，该系统能够在高性能计算基础设施中实现DI作业的双级调度。我们的系统利用实时资源分析来有效地协同调度HPC和DI应用程序。该架构可以很容易地扩展到当前和新型分布式应用程序，允许在HPC资源上有效地组合混合工作负载，从而提高作业吞吐量和整体资源利用率。该实现基于用于HPC和DI作业的Slurm和Mesos资源管理器。基于一组具有代表性的HPC和DI应用程序的真实集群中的实验评估表明，我们的混合架构在满足HPC作业的所有截止日期的同时，将资源利用率提高了20%，队列最大时间减少了12%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hybrid Resource Management for HPC and Data Intensive Workloads

High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed on separate clusters using different tools for resource and application management. With increasing convergence, where modern applications are composed of both types of jobs in complex workflows, this separation becomes a growing overhead and the need for a common platform increases. Executing both workload classes on the same clusters not only enables hybrid workflows, but can also increase system efficiency, as available hardware often is not fully utilized by applications. While HPC systems are typically managed in a coarse grained fashion, with exclusive resource allocations, DI systems employ a finer grained regime, enabling dynamic allocation and control based on application needs. On the path to full convergence, a useful and less intrusive step is a hybrid resource management system allowing the execution of DI applications on top of standard HPC scheduling systems. In this paper we present the architecture of a hybrid system enabling dual-level scheduling for DI jobs in HPC infrastructures. Our system takes advantage of real-time resource profiling to efficiently co-schedule HPC and DI applications. The architecture is easily extensible to current and new types of distributed applications, allowing efficient combination of hybrid workloads on HPC resources with increased job throughput and higher overall resource utilization. The implementation is based on the Slurm and Mesos resource managers for HPC and DI jobs. Experimental evaluations in a real cluster based on a set of representative HPC and DI applications demonstrate that our hybrid architecture improves resource utilization by 20%, with 12% decrease on queue makespan while still meeting all deadlines for HPC jobs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量