Medea: scheduling of long running applications in shared production clusters

Proceedings of the Thirteenth EuroSys Conference Pub Date : 2018-04-23 DOI:10.1145/3190508.3190549

Panagiotis Garefalakis, Konstantinos Karanasos, P. Pietzuch, Arun Suresh, Sriram Rao

{"title":"Medea: scheduling of long running applications in shared production clusters","authors":"Panagiotis Garefalakis, Konstantinos Karanasos, P. Pietzuch, Arun Suresh, Sriram Rao","doi":"10.1145/3190508.3190549","DOIUrl":null,"url":null,"abstract":"The rise in popularity of machine learning, streaming, and latency-sensitive online applications in shared production clusters has raised new challenges for cluster schedulers. To optimize their performance and resilience, these applications require precise control of their placements, by means of complex constraints, e.g., to collocate or separate their long-running containers across groups of nodes. In the presence of these applications, the cluster scheduler must attain global optimization objectives, such as maximizing the number of deployed applications or minimizing the violated constraints and the resource fragmentation, but without affecting the scheduling latency of short-running containers. We present Medea, a new cluster scheduler designed for the placement of long- and short-running containers. Medea introduces powerful placement constraints with formal semantics to capture interactions among containers within and across applications. It follows a novel two-scheduler design: (i) for long-running containers, it applies an optimization-based approach that accounts for constraints and global objectives; (ii) for short-running containers, it uses a traditional task-based scheduler for low placement latency. Evaluated on a 400-node cluster, our implementation of Medea on Apache Hadoop YARN achieves placement of long-running applications with significant performance and resilience benefits compared to state-of-the-art schedulers.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"89","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Thirteenth EuroSys Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190508.3190549","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 89

Abstract

The rise in popularity of machine learning, streaming, and latency-sensitive online applications in shared production clusters has raised new challenges for cluster schedulers. To optimize their performance and resilience, these applications require precise control of their placements, by means of complex constraints, e.g., to collocate or separate their long-running containers across groups of nodes. In the presence of these applications, the cluster scheduler must attain global optimization objectives, such as maximizing the number of deployed applications or minimizing the violated constraints and the resource fragmentation, but without affecting the scheduling latency of short-running containers. We present Medea, a new cluster scheduler designed for the placement of long- and short-running containers. Medea introduces powerful placement constraints with formal semantics to capture interactions among containers within and across applications. It follows a novel two-scheduler design: (i) for long-running containers, it applies an optimization-based approach that accounts for constraints and global objectives; (ii) for short-running containers, it uses a traditional task-based scheduler for low placement latency. Evaluated on a 400-node cluster, our implementation of Medea on Apache Hadoop YARN achieves placement of long-running applications with significant performance and resilience benefits compared to state-of-the-art schedulers.

查看原文本刊更多论文

Medea:在共享生产集群中调度长时间运行的应用程序

在共享生产集群中，机器学习、流媒体和对延迟敏感的在线应用程序越来越受欢迎，这给集群调度器带来了新的挑战。为了优化它们的性能和弹性，这些应用程序需要通过复杂的约束来精确控制它们的位置，例如，跨节点组配置或分离它们的长时间运行的容器。在存在这些应用程序的情况下，集群调度器必须实现全局优化目标，例如最大化部署的应用程序的数量或最小化违反的约束和资源碎片，但不影响短时间运行容器的调度延迟。我们介绍了Medea，这是一个新的集群调度器，专为放置长时间和短时间运行的容器而设计。Medea引入了强大的放置约束和形式化语义，以捕获应用程序内部和跨应用程序的容器之间的交互。它遵循一种新颖的双调度器设计:(i)对于长时间运行的容器，它应用基于优化的方法来考虑约束和全局目标;(ii)对于短时间运行的容器，它使用传统的基于任务的调度器来降低放置延迟。在400个节点的集群上进行评估后，我们在Apache Hadoop YARN上实现的Medea实现了长时间运行的应用程序的放置，与最先进的调度器相比，具有显著的性能和弹性优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Thirteenth EuroSys Conference

自引率

0.00%

发文量