Efficient execution of recursive programs on commodity vector hardware

Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation Pub Date : 2015-06-03 DOI:10.1145/2737924.2738004

Bin Ren, Youngjoon Jo, S. Krishnamoorthy, Kunal Agrawal, Milind Kulkarni

{"title":"Efficient execution of recursive programs on commodity vector hardware","authors":"Bin Ren, Youngjoon Jo, S. Krishnamoorthy, Kunal Agrawal, Milind Kulkarni","doi":"10.1145/2737924.2738004","DOIUrl":null,"url":null,"abstract":"The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units.","PeriodicalId":104101,"journal":{"name":"Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation","volume":"48 10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2737924.2738004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units.

查看原文本刊更多论文

有效地执行递归程序在商品矢量硬件

对计算效率的追求导致了面向吞吐量的硬件的激增，从gpu到商用处理器和加速器上日益广泛的矢量单元。该硬件旨在以矢量化的方式有效地执行数据并行计算。然而，许多算法更自然地表达为分治、递归、任务并行计算。在缺乏数据并行性的情况下，这种算法似乎不太适合面向吞吐量的体系结构。本文提出了一组新的代码转换，揭示了递归任务并行程序中潜在的数据并行性。这些转换有助于在商用硬件上对任务并行程序进行直接的向量化。我们还提出了在限制空间使用的同时保持矢量资源高利用率的调度策略。在几个任务并行基准测试中，我们展示了使用英特尔的SSE4.2矢量单元以及使用英特尔AVX512单元的加速器的高效矢量资源利用率和显著加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

自引率

0.00%

发文量