在不牺牲缓存性能的前提下提高递归模板计算的并行性

Proceedings of the Second Workshop on Optimizing Stencil Computations Pub Date : 2014-10-20 DOI:10.1145/2686745.2686752

Yuan Tang, R. You, Haibin Kan, Jesmin Jahan Tithi, P. Ganapathi, R. Chowdhury

{"title":"在不牺牲缓存性能的前提下提高递归模板计算的并行性","authors":"Yuan Tang, R. You, Haibin Kan, Jesmin Jahan Tithi, P. Ganapathi, R. Chowdhury","doi":"10.1145/2686745.2686752","DOIUrl":null,"url":null,"abstract":"The state-of-the-art \"trapezoidal decomposition algorithm\" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called \"cache-oblivious wavefront\" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance\",\"authors\":\"Yuan Tang, R. You, Haibin Kan, Jesmin Jahan Tithi, P. Ganapathi, R. Chowdhury\",\"doi\":\"10.1145/2686745.2686752\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The state-of-the-art \\\"trapezoidal decomposition algorithm\\\" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called \\\"cache-oblivious wavefront\\\" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.\",\"PeriodicalId\":367066,\"journal\":{\"name\":\"Proceedings of the Second Workshop on Optimizing Stencil Computations\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Second Workshop on Optimizing Stencil Computations\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2686745.2686752\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Second Workshop on Optimizing Stencil Computations","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2686745.2686752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

现代多核机器上用于模板计算的最先进的“梯形分解算法”使用递归分治(DAC)来实现渐近最优的缓存复杂度。但是，同样的DAC方法通过在子任务之间引入人工依赖关系来限制并行性，而这些依赖关系是由定义模板方程产生的。因此，梯形分解算法具有次优并行性。在本文中，我们提出了一种称为“缓存无关波前”(COW)的并行梯形分解算法的变体，它在不违反底层递归隐含的任何实际依赖关系的情况下，比原始算法规定的开始时间更早地开始执行递归子任务，从而减少了由于人为依赖而导致的序列化。序列化的减少导致并行性的提高。此外，由于我们没有改变原始算法中使用的基于dac的任务分解，因此缓存性能不会受到影响。我们提供了绝对运行时间、Cilkview的负载跨度和PAPI的L1/L2缓存缺失的实验测量来验证我们的说法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance

The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called "cache-oblivious wavefront" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Second Workshop on Optimizing Stencil Computations

自引率

0.00%

发文量