Jianyong Zhu;Hongtao Wang;Pan Su;Yang Wang;Weihua Pan
{"title":"Dynamic QoS-Driven Framework for Co-Scheduling of Distributed Long-Running Applications on Shared Clusters","authors":"Jianyong Zhu;Hongtao Wang;Pan Su;Yang Wang;Weihua Pan","doi":"10.1109/TCC.2025.3571098","DOIUrl":null,"url":null,"abstract":"Cloud service providers typically co-locate various workloads within the same production cluster to improve resource utilization and reduce operational costs. These workloads primarily consist of batch analysis jobs composed of multiple parallel short-running tasks and long-running applications (LRAs) that continuously reside in the system. The adoption of microservice architecture has led to the emergence of distributed LRAs (DLRAs), which enhance deployment flexibility but pose challenges in detecting and investigating QoS violations due to workload variability and performance propagation across microservices. State-of-the-art resource managers are only responsible for resource allocation among applications/jobs and do not prioritize runtime QoS aspects, such as application-level latency. To address this, we introduce Prank, a QoS-driven resource management framework for co-located workloads. Prank incorporates a non-intrusive performance anomaly detection mechanism for DLRAs and proposes a root cause localization algorithm based on PageRank-weighted analysis of performance anomalies. Moreover, it dynamically balances resource allocation between DLRAs and co-located batch jobs on nodes hosting critical microservices, optimizing for both DLRA performance and overall cluster efficiency. Experimental results demonstrate that Prank outperforms state-of-the-art baselines, reducing DLRA tail latency by over 38% while increasing batch job completion time by no more than 21% on average.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 3","pages":"837-853"},"PeriodicalIF":5.0000,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11006477/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Cloud service providers typically co-locate various workloads within the same production cluster to improve resource utilization and reduce operational costs. These workloads primarily consist of batch analysis jobs composed of multiple parallel short-running tasks and long-running applications (LRAs) that continuously reside in the system. The adoption of microservice architecture has led to the emergence of distributed LRAs (DLRAs), which enhance deployment flexibility but pose challenges in detecting and investigating QoS violations due to workload variability and performance propagation across microservices. State-of-the-art resource managers are only responsible for resource allocation among applications/jobs and do not prioritize runtime QoS aspects, such as application-level latency. To address this, we introduce Prank, a QoS-driven resource management framework for co-located workloads. Prank incorporates a non-intrusive performance anomaly detection mechanism for DLRAs and proposes a root cause localization algorithm based on PageRank-weighted analysis of performance anomalies. Moreover, it dynamically balances resource allocation between DLRAs and co-located batch jobs on nodes hosting critical microservices, optimizing for both DLRA performance and overall cluster efficiency. Experimental results demonstrate that Prank outperforms state-of-the-art baselines, reducing DLRA tail latency by over 38% while increasing batch job completion time by no more than 21% on average.
期刊介绍:
The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.