Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job Model

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-11-07 DOI:10.1109/TPDS.2024.3493631

Jonatha Anselmi;Josu Doncel

{"title":"Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job Model","authors":"Jonatha Anselmi;Josu Doncel","doi":"10.1109/TPDS.2024.3493631","DOIUrl":null,"url":null,"abstract":"We present a new framework for designing nonpreemptive and job-size oblivious scheduling policies in the multiserver-job queueing model. The main requirement is to identify a \n<i>static and balanced sub-partition</i>\n of the server set and ensure that the servers in each set of that sub-partition can only handle jobs of a given \n<i>class</i>\n and in a first-come first-served order. A job class is determined by the number of servers to which it has exclusive access during its entire execution and the probability distribution of its service time. This approach aims to reduce delays by preventing small jobs from being blocked by larger ones that arrived first, and it is particularly beneficial when the job size variability intra resp. inter classes is small resp. large. In this setting, we propose a new scheduling policy, Balanced-Splitting. In our main results, we provide a sufficient condition for the stability of Balanced-Splitting and show that the resulting queueing probability, i.e., the probability that an arriving job needs to wait for processing upon arrival, vanishes in both the subcritical (the load is kept fixed to a constant less than one) and critical (the load approaches one from below) many-server limiting regimes. Crucial to our analysis is a connection with the M/GI/\n<inline-formula><tex-math>$s$</tex-math></inline-formula>\n/\n<inline-formula><tex-math>$s$</tex-math></inline-formula>\n queue and Erlang’s loss formula, which allows our analysis to rely on fundamental results from queueing theory. Numerical simulations show that the proposed policy performs better than several preemptive/nonpreemptive size-aware/oblivious policies in various practical scenarios. This is also confirmed by simulations running on real traces from High Performance Computing (HPC) workloads. The delays induced by Balanced-Splitting are also competitive with those induced by state-of-the-art policies such as First-Fit-SRPT and ServerFilling-SRPT, though our approach has the advantage of not requiring preemption, nor the knowledge of job sizes.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"43-54"},"PeriodicalIF":5.6000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10746637/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

We present a new framework for designing nonpreemptive and job-size oblivious scheduling policies in the multiserver-job queueing model. The main requirement is to identify a static and balanced sub-partition of the server set and ensure that the servers in each set of that sub-partition can only handle jobs of a given class and in a first-come first-served order. A job class is determined by the number of servers to which it has exclusive access during its entire execution and the probability distribution of its service time. This approach aims to reduce delays by preventing small jobs from being blocked by larger ones that arrived first, and it is particularly beneficial when the job size variability intra resp. inter classes is small resp. large. In this setting, we propose a new scheduling policy, Balanced-Splitting. In our main results, we provide a sufficient condition for the stability of Balanced-Splitting and show that the resulting queueing probability, i.e., the probability that an arriving job needs to wait for processing upon arrival, vanishes in both the subcritical (the load is kept fixed to a constant less than one) and critical (the load approaches one from below) many-server limiting regimes. Crucial to our analysis is a connection with the M/GI/

$s$

queue and Erlang’s loss formula, which allows our analysis to rely on fundamental results from queueing theory. Numerical simulations show that the proposed policy performs better than several preemptive/nonpreemptive size-aware/oblivious policies in various practical scenarios. This is also confirmed by simulations running on real traces from High Performance Computing (HPC) workloads. The delays induced by Balanced-Splitting are also competitive with those induced by state-of-the-art policies such as First-Fit-SRPT and ServerFilling-SRPT, though our approach has the advantage of not requiring preemption, nor the knowledge of job sizes.

查看原文本刊更多论文

平衡拆分：在多服务器任务模型中实现零等待的框架

我们提出了一个新框架，用于在多服务器作业队列模型中设计非抢占式和作业大小忽略式调度策略。主要要求是确定服务器集的静态平衡子分区，并确保该子分区中的每一组服务器只能按先到先得的顺序处理给定类别的作业。作业类别由作业在整个执行过程中可独占访问的服务器数量及其服务时间的概率分布决定。这种方法的目的是防止小作业被先到的大作业阻塞，从而减少延迟。在这种情况下，我们提出了一种新的调度策略--平衡拆分。在我们的主要结果中，我们提供了平衡拆分法稳定性的充分条件，并证明了由此产生的排队概率，即到达的作业在到达后需要等待处理的概率，在亚临界（负载固定为小于 1 的常数）和临界（负载从下往上接近 1）多服务器极限状态下都会消失。我们的分析与 M/GI/$s$/$s$ 队列和 Erlang 损失公式之间的联系至关重要，这使得我们的分析可以依赖于队列理论的基本结果。数值模拟表明，在各种实际情况下，建议的策略比几种抢占式/非抢占式大小感知/盲目策略性能更好。在高性能计算（HPC）工作负载的真实轨迹上运行的仿真也证实了这一点。尽管我们的方法具有无需抢占、无需了解作业大小的优势，但平衡拆分引发的延迟与最先进的策略（如 First-Fit-SRPT 和 ServerFilling-SRPT）相比也具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.