Automatic Hierarchical Parallelization of Linear Recurrences

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI:10.1145/3173162.3173168

Sepideh Maleki, Martin Burtscher

引用次数: 7

Abstract

Linear recurrences encompass many fundamental computations including prefix sums and digital filters. Later result values depend on earlier result values in recurrences, making it a challenge to compute them in parallel. We present a new work- and space-efficient algorithm to compute linear recurrences that is amenable to automatic parallelization and suitable for hierarchical massively-parallel architectures such as GPUs. We implemented our approach in a domain-specific code generator that emits optimized CUDA code. Our evaluation shows that, for standard prefix sums and single-stage IIR filters, the generated code reaches the throughput of memory copy for large inputs, which cannot be surpassed. On higher-order prefix sums, it performs nearly as well as the fastest handwritten code from the literature. On tuple-based prefix sums and digital filters, our automatically parallelized code outperforms the fastest prior implementations.

查看原文本刊更多论文

线性递归的自动分层并行化

线性递归包含许多基本的计算，包括前缀和和数字滤波器。后期的结果值依赖于递归中早期的结果值，这使得并行计算它们成为一项挑战。我们提出了一种新的工作效率和空间效率高的算法来计算线性递归，该算法适用于自动并行化，并适用于gpu等分层大规模并行架构。我们在一个特定领域的代码生成器中实现了我们的方法，该生成器会发出优化的CUDA代码。我们的评估表明，对于标准前缀和和单阶段IIR过滤器，生成的代码达到了大输入的内存复制吞吐量，这是无法超越的。在高阶前缀和上，它的性能几乎和文献中最快的手写代码一样好。在基于元组的前缀和和数字过滤器上，我们的自动并行代码比之前最快的实现性能更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

自引率

0.00%

发文量