Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI:10.1109/ICPP.2010.13

Peng Di, Qing Wan, Xuemeng Zhang, Hui Wu, Jingling Xue

{"title":"Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs","authors":"Peng Di, Qing Wan, Xuemeng Zhang, Hui Wu, Jingling Xue","doi":"10.1109/ICPP.2010.13","DOIUrl":null,"url":null,"abstract":"To exploit the full potential of GPGPUs for general purpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed. However, the presence of cross-iteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of fine-grained threads. This work focuses on iterative PDE solvers rich in DOACR parallelism to identify optimization principles and strategies that allow their efficient mapping to GPGPUs. Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously optimized (by the compiler), and carefully tuned by a performance-tuning tool. We substantiate this finding with a case study by presenting a new parallel SSOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SSOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a generalized loop tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of applications, particularly PDE-based DOACR loops, on GPGPUs.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 39th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2010.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

To exploit the full potential of GPGPUs for general purpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed. However, the presence of cross-iteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of fine-grained threads. This work focuses on iterative PDE solvers rich in DOACR parallelism to identify optimization principles and strategies that allow their efficient mapping to GPGPUs. Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously optimized (by the compiler), and carefully tuned by a performance-tuning tool. We substantiate this finding with a case study by presenting a new parallel SSOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SSOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a generalized loop tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of applications, particularly PDE-based DOACR loops, on GPGPUs.

查看原文本刊更多论文

利用多gpu的DOACROSS并行性

为了充分利用gpgpu在通用计算中的潜力，必须利用科学和工程应用中丰富的DOACR并行性。然而，DOACR循环中交叉迭代数据依赖的存在给使用大量细粒度线程并发执行它们的计算带来了障碍。这项工作的重点是丰富DOACR并行性的迭代PDE求解器，以确定优化原则和策略，使其能够有效地映射到gpgpu。我们的主要发现是，某些DOACR循环可以在GPGPU上进一步加速，如果它们被算法重构(由领域专家)以更适合GPGPU并行化，明智地优化(由编译器)，并通过性能调优工具仔细调整。我们通过一个案例研究证实了这一发现，提出了一种新的并行SSOR方法，该方法比gpgpu上的红黑SOR更有效地执行数据并行SIMD。我们的解是非常规的，从k层SSOR方法开始，然后通过应用由新的域分解技术和广义循环平铺组成的非依赖保持方案并行化它。尽管收敛速度相对较慢，但我们的新方法通过在数据重用和并行性之间实现更好的平衡，并通过在收敛率与SIMD并行性之间进行权衡，从而优于红黑SOR。我们的实验结果强调了领域专家之间的协同，编译器优化和性能调优在最大化应用程序性能方面的重要性，特别是基于pde的DOACR循环，在gpgpu上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 39th International Conference on Parallel Processing

自引率

0.00%

发文量