Adaptive CPU sharing for co-located latency-critical JVM applications and batch jobs under dynamic workloads

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2026-07-01 Epub Date: 2026-01-23 DOI:10.1016/j.future.2026.108387

Dishi Xu , Fagui Liu , Bin Wang , Xuhao Tang , Qingbo Wu

{"title":"Adaptive CPU sharing for co-located latency-critical JVM applications and batch jobs under dynamic workloads","authors":"Dishi Xu , Fagui Liu , Bin Wang , Xuhao Tang , Qingbo Wu","doi":"10.1016/j.future.2026.108387","DOIUrl":null,"url":null,"abstract":"<div><div>Latency-critical (LC) long-running applications operating on Java Virtual Machines (JLRAs) often rely on substantial CPU over-provisioning to meet Service-Level Objectives (SLOs) under dynamic workloads, leading to significant resource underutilization. Additionally, JLRAs exhibit inferior cold-start performance, and frequent deletion and creation of application instances to adjust resource allocation results in performance degradation. Furthermore, harvesting redundant resources by deploying best-effort (BE) batch jobs alongside JLRAs encounters serious challenges due to contention for shared CPU resources. Therefore, we present ChaosRM, a bi-level resource management framework for JVM workload co-location to enhance resource utilization efficiency while eliminating resource contention. In contrast to the conventional approach of isolating JLRAs and batch jobs on non-overlapping CPU sets, ChaosRM proposes a tri-zone CPU isolation mechanism, utilizing two CPU zones to isolate JLRAs and batch jobs, and an shared region for concurrently executing their threads. An application-wide, learning-based Application Manager adjusts the instance states of JLRAs based on the global workload and adaptively learns the shared zone allocation strategy and the performance target represented by thread queuing time; the Node Manager on each server heuristically binds CPU sets to JLRAs and dynamically schedules batch jobs among CPU zones according to this performance target and the JLRA instance states. Experimental results show that, while guaranteeing the SLOs of JLRAs, ChaosRM reduces the completion time of batch jobs by up to 14.10% over the best-performing baseline and up to 54.29% over all baselines.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108387"},"PeriodicalIF":6.2000,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X2600021X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/23 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Latency-critical (LC) long-running applications operating on Java Virtual Machines (JLRAs) often rely on substantial CPU over-provisioning to meet Service-Level Objectives (SLOs) under dynamic workloads, leading to significant resource underutilization. Additionally, JLRAs exhibit inferior cold-start performance, and frequent deletion and creation of application instances to adjust resource allocation results in performance degradation. Furthermore, harvesting redundant resources by deploying best-effort (BE) batch jobs alongside JLRAs encounters serious challenges due to contention for shared CPU resources. Therefore, we present ChaosRM, a bi-level resource management framework for JVM workload co-location to enhance resource utilization efficiency while eliminating resource contention. In contrast to the conventional approach of isolating JLRAs and batch jobs on non-overlapping CPU sets, ChaosRM proposes a tri-zone CPU isolation mechanism, utilizing two CPU zones to isolate JLRAs and batch jobs, and an shared region for concurrently executing their threads. An application-wide, learning-based Application Manager adjusts the instance states of JLRAs based on the global workload and adaptively learns the shared zone allocation strategy and the performance target represented by thread queuing time; the Node Manager on each server heuristically binds CPU sets to JLRAs and dynamically schedules batch jobs among CPU zones according to this performance target and the JLRA instance states. Experimental results show that, while guaranteeing the SLOs of JLRAs, ChaosRM reduces the completion time of batch jobs by up to 14.10% over the best-performing baseline and up to 54.29% over all baselines.

查看原文本刊更多论文

动态工作负载下共定位延迟关键型JVM应用程序和批处理作业的自适应CPU共享

在Java虚拟机（jlra）上运行的延迟关键型（LC）长时间运行的应用程序通常依赖大量的CPU过度配置来满足动态工作负载下的服务水平目标（Service-Level Objectives, slo），从而导致严重的资源利用率不足。此外，jlra表现出较差的冷启动性能，频繁删除和创建应用程序实例以调整资源分配会导致性能下降。此外，由于共享CPU资源的争用，通过与jlra一起部署尽力而为（BE）批处理作业来获取冗余资源会遇到严重的挑战。因此，我们提出了ChaosRM，一个用于JVM工作负载托管的双层资源管理框架，以提高资源利用效率，同时消除资源争用。与在非重叠CPU集中隔离jlra和批处理作业的传统方法不同，ChaosRM提出了一种三区CPU隔离机制，利用两个CPU区域隔离jlra和批处理作业，并利用一个共享区域并发执行它们的线程。应用程序范围的、基于学习的应用程序管理器根据全局工作负载调整jlra的实例状态，并自适应地学习共享区域分配策略和线程排队时间表示的性能目标；每个服务器上的节点管理器启发式地将CPU集绑定到JLRA，并根据此性能目标和JLRA实例状态在CPU区域之间动态调度批处理作业。实验结果表明，在保证jlra的SLOs的同时，ChaosRM将批处理作业的完成时间比最佳基准减少了14.10%，比所有基准减少了54.29%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.