Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures Pub Date : 2016-02-15 DOI:10.1145/2935764.2935797

David Dinh, H. Simhadri, Yuan Tang

{"title":"Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers","authors":"David Dinh, H. Simhadri, Yuan Tang","doi":"10.1145/2935764.2935797","DOIUrl":null,"url":null,"abstract":"The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. \"||\" (parallel) and \";\" (serial), that comprise the nested-parallel model are insufficient in expressing \"partial dependencies\" in a program. We propose a new dataflow composition construct \"↝\" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the Nested Dataflow (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and prove guarantees on their ability to balance nodes across processors and preserve locality. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased \"parallelizability\" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O((∑i=0h-1 Q*(t;σ⋅ Mi)⋅ Ci)/p) on a p-processor machine, where Q* is the parallel cache complexity of task t, Ci is the cost of cache miss at level-i cache which is of size Mi, and σ∈(0,1) is a constant.","PeriodicalId":346939,"journal":{"name":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2935764.2935797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "||" (parallel) and ";" (serial), that comprise the nested-parallel model are insufficient in expressing "partial dependencies" in a program. We propose a new dataflow composition construct "↝" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the Nested Dataflow (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and prove guarantees on their ability to balance nodes across processors and preserve locality. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O((∑i=0h-1 Q*(t;σ⋅ Mi)⋅ Ci)/p) on a p-processor machine, where Q* is the parallel cache complexity of task t, Ci is the cost of cache miss at level-i cache which is of size Mi, and σ∈(0,1) is a constant.

查看原文本刊更多论文

用有效调度器将嵌套并行模型扩展到嵌套数据流模型

嵌套并行(又名fork-join)模型被广泛用于编写并行程序。然而，这两个组合结构，即。"||"(平行)和";"(串行)，组成嵌套并行模型不足以表达程序中的“部分依赖关系”。我们提出了一种新的数据流组合结构“* * * *”，以处理器和缓存无关的方式表达算法中的部分依赖关系，从而将嵌套并行(NP)模型扩展到嵌套数据流(ND)模型。我们在ND模型中重新设计了从密集线性代数到动态规划的几种分治算法，并证明了它们在保持最优缓存复杂度的同时都具有最优跨度。我们建议设计运行时调度器，将ND程序映射到具有多个可能共享缓存级别(即并行内存层次)的多核处理器，并证明它们在处理器之间平衡节点和保持局域性的能力。为此，我们为ND模型采用了空间边界(SB)调度器。我们表明，我们的算法在ND模型中增加了“并行性”，并且SB调度器可以使用额外的并行性来实现与NP模型相比在更多处理器上的缓存丢失和运行时间的渐近最优边界。本文算法在p处理器机器上的运行时间为O((∑i=0h-1 Q*(t;σ⋅Mi)⋅Ci)/p)，其中Q*为任务t的并行缓存复杂度，Ci为大小为Mi的i级缓存的缓存缺失代价，σ∈(0,1)为常数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures

自引率

0.00%

发文量