Dynamic QoS-Driven Framework for Co-Scheduling of Distributed Long-Running Applications on Shared Clusters

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing Pub Date : 2025-03-16 DOI:10.1109/TCC.2025.3571098

Jianyong Zhu;Hongtao Wang;Pan Su;Yang Wang;Weihua Pan

{"title":"Dynamic QoS-Driven Framework for Co-Scheduling of Distributed Long-Running Applications on Shared Clusters","authors":"Jianyong Zhu;Hongtao Wang;Pan Su;Yang Wang;Weihua Pan","doi":"10.1109/TCC.2025.3571098","DOIUrl":null,"url":null,"abstract":"Cloud service providers typically co-locate various workloads within the same production cluster to improve resource utilization and reduce operational costs. These workloads primarily consist of batch analysis jobs composed of multiple parallel short-running tasks and long-running applications (LRAs) that continuously reside in the system. The adoption of microservice architecture has led to the emergence of distributed LRAs (DLRAs), which enhance deployment flexibility but pose challenges in detecting and investigating QoS violations due to workload variability and performance propagation across microservices. State-of-the-art resource managers are only responsible for resource allocation among applications/jobs and do not prioritize runtime QoS aspects, such as application-level latency. To address this, we introduce Prank, a QoS-driven resource management framework for co-located workloads. Prank incorporates a non-intrusive performance anomaly detection mechanism for DLRAs and proposes a root cause localization algorithm based on PageRank-weighted analysis of performance anomalies. Moreover, it dynamically balances resource allocation between DLRAs and co-located batch jobs on nodes hosting critical microservices, optimizing for both DLRA performance and overall cluster efficiency. Experimental results demonstrate that Prank outperforms state-of-the-art baselines, reducing DLRA tail latency by over 38% while increasing batch job completion time by no more than 21% on average.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 3","pages":"837-853"},"PeriodicalIF":5.0000,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11006477/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Cloud service providers typically co-locate various workloads within the same production cluster to improve resource utilization and reduce operational costs. These workloads primarily consist of batch analysis jobs composed of multiple parallel short-running tasks and long-running applications (LRAs) that continuously reside in the system. The adoption of microservice architecture has led to the emergence of distributed LRAs (DLRAs), which enhance deployment flexibility but pose challenges in detecting and investigating QoS violations due to workload variability and performance propagation across microservices. State-of-the-art resource managers are only responsible for resource allocation among applications/jobs and do not prioritize runtime QoS aspects, such as application-level latency. To address this, we introduce Prank, a QoS-driven resource management framework for co-located workloads. Prank incorporates a non-intrusive performance anomaly detection mechanism for DLRAs and proposes a root cause localization algorithm based on PageRank-weighted analysis of performance anomalies. Moreover, it dynamically balances resource allocation between DLRAs and co-located batch jobs on nodes hosting critical microservices, optimizing for both DLRA performance and overall cluster efficiency. Experimental results demonstrate that Prank outperforms state-of-the-art baselines, reducing DLRA tail latency by over 38% while increasing batch job completion time by no more than 21% on average.

查看原文本刊更多论文

共享集群上分布式长时间运行应用协同调度的动态qos驱动框架

云服务提供商通常在相同的生产集群中共同定位各种工作负载，以提高资源利用率并降低运营成本。这些工作负载主要由批处理分析作业组成，这些作业由多个并行的短时间运行任务和长期运行的应用程序（lra）组成，这些应用程序持续驻留在系统中。微服务架构的采用导致了分布式lra （dlra）的出现，它增强了部署的灵活性，但在检测和调查由于工作负载可变性和跨微服务的性能传播而导致的QoS违反方面提出了挑战。最先进的资源管理器只负责应用程序/作业之间的资源分配，而不优先考虑运行时QoS方面，例如应用程序级延迟。为了解决这个问题，我们引入了一个qos驱动的资源管理框架，用于共同定位的工作负载。该文为dlra引入了一种非侵入式性能异常检测机制，并提出了一种基于pagerank加权性能异常分析的根本原因定位算法。此外，它还动态平衡了DLRA和托管关键微服务节点上的批处理作业之间的资源分配，优化了DLRA的性能和整体集群效率。实验结果表明，恶作剧优于最先进的基线，将DLRA尾部延迟减少了38%以上，而将批处理作业完成时间平均增加了不超过21%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cloud Computing Computer Science-Software

CiteScore

9.40

自引率

6.20%

发文量

167

期刊介绍： The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.