在不牺牲缓存性能的前提下提高递归模板计算的并行性

Yuan Tang, R. You, Haibin Kan, Jesmin Jahan Tithi, P. Ganapathi, R. Chowdhury
{"title":"在不牺牲缓存性能的前提下提高递归模板计算的并行性","authors":"Yuan Tang, R. You, Haibin Kan, Jesmin Jahan Tithi, P. Ganapathi, R. Chowdhury","doi":"10.1145/2686745.2686752","DOIUrl":null,"url":null,"abstract":"The state-of-the-art \"trapezoidal decomposition algorithm\" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called \"cache-oblivious wavefront\" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance\",\"authors\":\"Yuan Tang, R. You, Haibin Kan, Jesmin Jahan Tithi, P. Ganapathi, R. Chowdhury\",\"doi\":\"10.1145/2686745.2686752\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The state-of-the-art \\\"trapezoidal decomposition algorithm\\\" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called \\\"cache-oblivious wavefront\\\" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.\",\"PeriodicalId\":367066,\"journal\":{\"name\":\"Proceedings of the Second Workshop on Optimizing Stencil Computations\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Second Workshop on Optimizing Stencil Computations\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2686745.2686752\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Second Workshop on Optimizing Stencil Computations","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2686745.2686752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

现代多核机器上用于模板计算的最先进的“梯形分解算法”使用递归分治(DAC)来实现渐近最优的缓存复杂度。但是,同样的DAC方法通过在子任务之间引入人工依赖关系来限制并行性,而这些依赖关系是由定义模板方程产生的。因此,梯形分解算法具有次优并行性。在本文中,我们提出了一种称为“缓存无关波前”(COW)的并行梯形分解算法的变体,它在不违反底层递归隐含的任何实际依赖关系的情况下,比原始算法规定的开始时间更早地开始执行递归子任务,从而减少了由于人为依赖而导致的序列化。序列化的减少导致并行性的提高。此外,由于我们没有改变原始算法中使用的基于dac的任务分解,因此缓存性能不会受到影响。我们提供了绝对运行时间、Cilkview的负载跨度和PAPI的L1/L2缓存缺失的实验测量来验证我们的说法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance
The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called "cache-oblivious wavefront" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信