动态管道并行性

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI:10.1145/2486159.2486174

I. Lee, C. Leiserson, T. Schardl, Jim Sukha, Zhunping Zhang

{"title":"动态管道并行性","authors":"I. Lee, C. Leiserson, T. Schardl, Jim Sukha, Zhunping Zhang","doi":"10.1145/2486159.2486174","DOIUrl":null,"url":null,"abstract":"Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a \"construct-and-run\" approach, this paper investigates \"on-the-fly\" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding \"runaway\" pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP≤ T1/P + O(T∞ + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"On-the-fly pipeline parallelism\",\"authors\":\"I. Lee, C. Leiserson, T. Schardl, Jim Sukha, Zhunping Zhang\",\"doi\":\"10.1145/2486159.2486174\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a \\\"construct-and-run\\\" approach, this paper investigates \\\"on-the-fly\\\" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding \\\"runaway\\\" pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP≤ T1/P + O(T∞ + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.\",\"PeriodicalId\":353007,\"journal\":{\"name\":\"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2486159.2486174\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2486159.2486174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

管道并行将并行程序组织为s个阶段的线性序列。每个阶段处理数据流的元素，将每个处理过的数据元素传递到下一个阶段，然后在后续阶段必须完成处理之前接收一个新元素。管道并行尤其用于执行视频、音频和数字信号处理的流应用程序中。PARSEC是一种为共享内存多处理器设计的流行软件基准测试套件，它的13个基准测试中有3个可以表示为流水线并行性。尽管大多数支持管道并行的并发平台使用“构建并运行”的方法，但本文研究的是“实时”管道并行，其中管道的结构在程序执行时出现，而不是先验地指定。动态管道并行性允许阶段的数量随迭代而变化，并且依赖关系依赖于数据。我们提出了一种简单的语言来指定实时管道并行性，并描述了一种可证明有效的调度算法，Piper算法，它将管道并行性集成到一个工作窃取调度程序中，允许管道和fork-join并行性任意嵌套。Piper算法自动限制并行性，防止“失控”管道。给定一个T1功和T∞跨度(关键路径长度)的流水线计算，Piper在TP≤T1/P + O(T∞+ lg P)期望时间内在P个处理器上执行计算。Piper还限制堆栈空间，确保它不会随着运行时间无限制地增长。我们已经将实时的管道并行性整合到基于线程的工作窃取运行时系统中。我们的原型Cilk-P实现利用了延迟启用和依赖折叠等优化。我们已经移植了三个PARSEC基准测试，它们展示了管道并行性，可以在Cilk-P上运行。其中之一，x264，不能轻易地由只支持构建并运行的管道并行性的系统执行。基准测试结果表明，Cilk-P具有较低的串行开销和良好的可扩展性。例如，在x264上，当Cilk-P在16个处理器上运行时，其速度比相应的串行版本提高了13.87。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

On-the-fly pipeline parallelism

Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP≤ T1/P + O(T∞ + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

自引率

0.00%

发文量