非结构化并行的位置感知任务管理:定量限制研究

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures Pub Date : 2013-07-23 DOI:10.1145/2486159.2486175

Richard M. Yoo, C. Hughes, Changkyu Kim, Yen-kuang Chen, C. Kozyrakis

{"title":"非结构化并行的位置感知任务管理:定量限制研究","authors":"Richard M. Yoo, C. Hughes, Changkyu Kim, Yen-kuang Chen, C. Kozyrakis","doi":"10.1145/2486159.2486175","DOIUrl":null,"url":null,"abstract":"As we increase the number of cores on a processor die, the on-chip cache hierarchies that support these cores are getting larger, deeper, and more complex. As a result, non-uniform memory access effects are now prevalent even on a single chip. To reduce execution time and energy consumption, data access locality should be exploited. This is especially important for task-based programming systems, where a scheduler decides when and where on the chip the code segments, i.e., tasks, should execute. Capturing locality for structured task parallelism has been done effectively, but the more difficult case, unstructured parallelism, remains largely unsolved - little quantitative analysis exists to demonstrate the potential of locality-aware scheduling, and to guide future scheduler implementations in the most fruitful direction. This paper quantifies the potential of locality-aware scheduling for unstructured parallelism on three different many-core processors. Our simulation results of 32-core systems show that locality-aware scheduling can bring up to 2.39x speedup over a randomized schedule, and 2.05x speedup over a state-of-the-art baseline scheduling scheme. At the same time, a locality-aware schedule reduces average energy consumption by 55% and 47%, relative to the random and the baseline schedule, respectively. In addition, our 1024-core simulation results project that these benefits will only increase: Compared to 32-core executions, we see up to 1.83x additional locality benefits. To capture such potentials in a practical setting, we also perform a detailed scheduler design space exploration to quantify the impact of different scheduling decisions. We also highlight the importance of locality-aware stealing, and demonstrate that a stealing scheme can exploit significant locality while performing load balancing. Over randomized stealing, our proposed scheme shows up to 2.0x speedup for stolen tasks.","PeriodicalId":353007,"journal":{"name":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":"{\"title\":\"Locality-aware task management for unstructured parallelism: a quantitative limit study\",\"authors\":\"Richard M. Yoo, C. Hughes, Changkyu Kim, Yen-kuang Chen, C. Kozyrakis\",\"doi\":\"10.1145/2486159.2486175\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As we increase the number of cores on a processor die, the on-chip cache hierarchies that support these cores are getting larger, deeper, and more complex. As a result, non-uniform memory access effects are now prevalent even on a single chip. To reduce execution time and energy consumption, data access locality should be exploited. This is especially important for task-based programming systems, where a scheduler decides when and where on the chip the code segments, i.e., tasks, should execute. Capturing locality for structured task parallelism has been done effectively, but the more difficult case, unstructured parallelism, remains largely unsolved - little quantitative analysis exists to demonstrate the potential of locality-aware scheduling, and to guide future scheduler implementations in the most fruitful direction. This paper quantifies the potential of locality-aware scheduling for unstructured parallelism on three different many-core processors. Our simulation results of 32-core systems show that locality-aware scheduling can bring up to 2.39x speedup over a randomized schedule, and 2.05x speedup over a state-of-the-art baseline scheduling scheme. At the same time, a locality-aware schedule reduces average energy consumption by 55% and 47%, relative to the random and the baseline schedule, respectively. In addition, our 1024-core simulation results project that these benefits will only increase: Compared to 32-core executions, we see up to 1.83x additional locality benefits. To capture such potentials in a practical setting, we also perform a detailed scheduler design space exploration to quantify the impact of different scheduling decisions. We also highlight the importance of locality-aware stealing, and demonstrate that a stealing scheme can exploit significant locality while performing load balancing. Over randomized stealing, our proposed scheme shows up to 2.0x speedup for stolen tasks.\",\"PeriodicalId\":353007,\"journal\":{\"name\":\"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"38\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2486159.2486175\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2486159.2486175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

摘要

随着处理器芯片上内核数量的增加，支持这些内核的片上缓存层次结构变得越来越大、越来越深、越来越复杂。因此，现在即使在单个芯片上也普遍存在非均匀存储器访问效应。为了减少执行时间和能量消耗，应该利用数据访问局部性。这对于基于任务的编程系统尤其重要，其中调度程序决定代码段(即任务)应该在芯片上的何时何地执行。为结构化任务并行性捕获局部性已经得到了有效的实现，但是更困难的情况，非结构化并行性，在很大程度上仍然没有得到解决——很少有定量分析来证明局部性感知调度的潜力，并指导未来调度程序实现最有效的方向。本文量化了在三种不同的多核处理器上对非结构化并行性进行位置感知调度的潜力。我们对32核系统的仿真结果表明，位置感知调度可以比随机调度带来2.39倍的加速，比最先进的基线调度方案带来2.05倍的加速。与此同时，相对于随机调度和基线调度，位置感知调度的平均能耗分别降低了55%和47%。此外，我们的1024核模拟结果表明，这些好处只会增加:与32核执行相比，我们看到了1.83倍的额外局域性好处。为了在实际环境中捕捉这种可能性，我们还执行了详细的调度器设计空间探索，以量化不同调度决策的影响。我们还强调了位置感知窃取的重要性，并证明了窃取方案可以在执行负载平衡时利用重要的局域性。与随机窃取相比，我们提出的方案显示被盗任务的加速高达2.0倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Locality-aware task management for unstructured parallelism: a quantitative limit study

As we increase the number of cores on a processor die, the on-chip cache hierarchies that support these cores are getting larger, deeper, and more complex. As a result, non-uniform memory access effects are now prevalent even on a single chip. To reduce execution time and energy consumption, data access locality should be exploited. This is especially important for task-based programming systems, where a scheduler decides when and where on the chip the code segments, i.e., tasks, should execute. Capturing locality for structured task parallelism has been done effectively, but the more difficult case, unstructured parallelism, remains largely unsolved - little quantitative analysis exists to demonstrate the potential of locality-aware scheduling, and to guide future scheduler implementations in the most fruitful direction. This paper quantifies the potential of locality-aware scheduling for unstructured parallelism on three different many-core processors. Our simulation results of 32-core systems show that locality-aware scheduling can bring up to 2.39x speedup over a randomized schedule, and 2.05x speedup over a state-of-the-art baseline scheduling scheme. At the same time, a locality-aware schedule reduces average energy consumption by 55% and 47%, relative to the random and the baseline schedule, respectively. In addition, our 1024-core simulation results project that these benefits will only increase: Compared to 32-core executions, we see up to 1.83x additional locality benefits. To capture such potentials in a practical setting, we also perform a detailed scheduler design space exploration to quantify the impact of different scheduling decisions. We also highlight the importance of locality-aware stealing, and demonstrate that a stealing scheme can exploit significant locality while performing load balancing. Over randomized stealing, our proposed scheme shows up to 2.0x speedup for stolen tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

自引率

0.00%

发文量