Adaptive work-stealing with parallelism feedback

IF 2 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems Pub Date : 2008-09-01 DOI:10.1145/1394441.1394443

Kunal Agrawal, C. Leiserson, Yuxiong He, W. Hsu

{"title":"Adaptive work-stealing with parallelism feedback","authors":"Kunal Agrawal, C. Leiserson, Yuxiong He, W. Hsu","doi":"10.1145/1394441.1394443","DOIUrl":null,"url":null,"abstract":"Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors.\n We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority.\n More precisely, suppose that a job has work T1 and span T∞. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/&Ptilde; + T∞ + L lg P) time steps, where L is the length of a scheduling quantum, and &Ptilde; denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T∞ + L lg P) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, &Ptilde; ≪ T1/T∞, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the length of its span, which is optimal.\n We measured the performance of A-STEAL on a simulated multiprocessor system using synthetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL provides almost perfect linear speedup across a variety of processor availability profiles. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora et al. [1998] which does not employ parallelism feedback. On moderately to heavily loaded machines with large numbers of processors, A-STEAL typically completed jobs more than twice as quickly as ABP, despite being allotted the same number or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"160 38 1","pages":"7:1-7:32"},"PeriodicalIF":2.0000,"publicationDate":"2008-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/1394441.1394443","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 20

Abstract

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors. We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority. More precisely, suppose that a job has work T1 and span T∞. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/&Ptilde; + T∞ + L lg P) time steps, where L is the length of a scheduling quantum, and &Ptilde; denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T∞ + L lg P) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, &Ptilde; ≪ T1/T∞, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the length of its span, which is optimal. We measured the performance of A-STEAL on a simulated multiprocessor system using synthetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL provides almost perfect linear speedup across a variety of processor availability profiles. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora et al. [1998] which does not employ parallelism feedback. On moderately to heavily loaded machines with large numbers of processors, A-STEAL typically completed jobs more than twice as quickly as ABP, despite being allotted the same number or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP.

查看原文本刊更多论文

具有并行反馈的自适应工作窃取

共享多程序设计环境中的多处理器调度可以结构化为两级调度，其中内核级作业调度器将处理器分配给作业，用户级线程调度器在分配的处理器上调度作业的工作。我们为fork-join多线程作业提供了一个随机窃取工作的线程调度器，它以请求处理器的形式向作业调度器提供持续的并行性反馈。我们的a - steal算法适用于大型并行服务器，其中许多作业共享一个公共多处理器资源，并且在作业执行期间，特定作业可用的处理器数量可能会变化。假设作业调度器分配的处理器数量永远不会超过作业线程调度器所要求的数量，那么a - steal可以保证作业在接近最佳的时间内完成，同时至少利用分配的处理器的一个常数部分。我们将作业调度器建模为线程调度器的对手，要求线程调度器对操作环境以及作业调度器的管理策略具有健壮性。例如，作业调度器可能会在作业几乎不需要大量处理器的时候使它们可用。为了在这种严格的对抗性假设下分析自适应线程调度器的性能，我们引入了一种称为修剪分析的新技术，它允许我们证明线程调度器在不超过一小部分时间步上表现不佳，而在绝大多数时间步上表现出接近最佳的行为。更准确地说，假设一个作业的功T1和张成的空间T∞。在有P个处理器的机器上，a - steal在O(T1/&Ptilde)的预期持续时间内完成任务。+ T∞+ llgp)时间步长，其中L为调度量子的长度，&Ptilde;表示O(T∞+ llg P)裁剪的可用性。这个量是处理器可用性在除具有最高处理器可用性的O(T∞+ L lg P)时间步之外的所有时间步上的平均值。当作业的并行性占精简可用性的主导地位时，即&Ptilde;≪T1/T∞，可实现近乎完美的线性加速。相反，当裁剪均值占并行度的主导地位时，作业的渐近运行时间接近其跨度的长度，这是最优的。我们使用合成工作负载在模拟的多处理器系统上测量了a - steal的性能。对于具有足够并行性的作业，我们的实验证实，a - steal在各种处理器可用性配置文件中提供了几乎完美的线性加速。我们将A-STEAL与ABP算法进行了比较，ABP算法是由Arora等人[1998]开发的一种不采用并行反馈的自适应工作窃取线程调度器。在具有大量处理器的中度到重度负载机器上，尽管每一步分配的处理器数量相同或更少，但A-STEAL完成作业的速度通常是ABP的两倍以上，而浪费的处理器周期仅为ABP浪费的10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Computer Systems 工程技术-计算机：理论方法

CiteScore

4.00

自引率

0.00%

发文量

审稿时长

1 months

期刊介绍： ACM Transactions on Computer Systems (TOCS) presents research and development results on the design, implementation, analysis, evaluation, and use of computer systems and systems software. The term "computer systems" is interpreted broadly and includes operating systems, systems architecture and hardware, distributed systems, optimizing compilers, and the interaction between systems and computer networks. Articles appearing in TOCS will tend either to present new techniques and concepts, or to report on experiences and experiments with actual systems. Insights useful to system designers, builders, and users will be emphasized. TOCS publishes research and technical papers, both short and long. It includes technical correspondence to permit commentary on technical topics and on previously published papers.