异构超级计算机的分层三对角线系统求解器

2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2014-11-16 DOI:10.1109/ScalA.2014.12

Xinliang Wang, Yangtong Xu, Wei Xue

{"title":"异构超级计算机的分层三对角线系统求解器","authors":"Xinliang Wang, Yangtong Xu, Wei Xue","doi":"10.1109/ScalA.2014.12","DOIUrl":null,"url":null,"abstract":"Tridiagonal system solver is an important kernel in many scientific and engineering applications. Even though quite a few parallel algorithms and implementations have been addressed in recent years, challenges still remain when solving large-scale tridiagonal system on heterogenous supercomputers. In this paper, a hierarchical algorithm framework SPIKE (pronounced 'SPIKE squared') is proposed to minimize the parallel overhead and to achieve the best utilization of CPU-GPU hybrid systems. In these systems, a layered and adaptive partitioning is presented based on the SPIKE algorithm to effectively control the sequential parts while efficiently exploiting the computation and communication overlapping in heterogeneous computing node. Moreover, the SPIKE algorithm is reformulated to reduce the matrix computations to only 1/3 in our hierarchical algorithm framework. Meanwhile, an improved implementation of the tiled-PCR-pThomas algorithm is employed for the GPU architecture, and the shared memory usage on the GPU can be reduced by 1/3 using careful dependence analysis on solving unit vector tridiagonal systems. Our experiments on Tianhe-1A show ideal weak scalability on up to 128 nodes when solving a tridiagonal system with a size of 1920M in the largest run and good strong scalability (70%) from 32 nodes to 256 nodes when solving a tridiagonal system with a size of 480M. Furthermore, the adaptive task partition across the CPU and GPU can get over 10% performance improvement in the strong scaling test with 256 nodes. In one computing node of Tianhe-1A, our GPU-only code can outperform the CUSPARSE version (non-pivoting tridiagonal solver) by 30%, and our hybrid code is about 6.7 times faster than the Intel SPIKE multi-process version for tridiagonal systems having a size of 3M, 5M, and 15M.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"A Hierarchical Tridiagonal System Solver for Heterogenous Supercomputers\",\"authors\":\"Xinliang Wang, Yangtong Xu, Wei Xue\",\"doi\":\"10.1109/ScalA.2014.12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Tridiagonal system solver is an important kernel in many scientific and engineering applications. Even though quite a few parallel algorithms and implementations have been addressed in recent years, challenges still remain when solving large-scale tridiagonal system on heterogenous supercomputers. In this paper, a hierarchical algorithm framework SPIKE (pronounced 'SPIKE squared') is proposed to minimize the parallel overhead and to achieve the best utilization of CPU-GPU hybrid systems. In these systems, a layered and adaptive partitioning is presented based on the SPIKE algorithm to effectively control the sequential parts while efficiently exploiting the computation and communication overlapping in heterogeneous computing node. Moreover, the SPIKE algorithm is reformulated to reduce the matrix computations to only 1/3 in our hierarchical algorithm framework. Meanwhile, an improved implementation of the tiled-PCR-pThomas algorithm is employed for the GPU architecture, and the shared memory usage on the GPU can be reduced by 1/3 using careful dependence analysis on solving unit vector tridiagonal systems. Our experiments on Tianhe-1A show ideal weak scalability on up to 128 nodes when solving a tridiagonal system with a size of 1920M in the largest run and good strong scalability (70%) from 32 nodes to 256 nodes when solving a tridiagonal system with a size of 480M. Furthermore, the adaptive task partition across the CPU and GPU can get over 10% performance improvement in the strong scaling test with 256 nodes. In one computing node of Tianhe-1A, our GPU-only code can outperform the CUSPARSE version (non-pivoting tridiagonal solver) by 30%, and our hybrid code is about 6.7 times faster than the Intel SPIKE multi-process version for tridiagonal systems having a size of 3M, 5M, and 15M.\",\"PeriodicalId\":323689,\"journal\":{\"name\":\"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ScalA.2014.12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ScalA.2014.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

三对角线系统求解是许多科学和工程应用的重要核心。尽管近年来已经解决了相当多的并行算法和实现，但在异构超级计算机上解决大规模三对角线系统仍然存在挑战。本文提出了一种分层算法框架SPIKE(发音为“SPIKE squared”)，以最小化并行开销并实现CPU-GPU混合系统的最佳利用率。在这些系统中，基于SPIKE算法提出了分层自适应分区，在有效控制顺序部分的同时，有效地利用了异构计算节点中计算和通信的重叠。此外，在我们的分层算法框架中，对SPIKE算法进行了重新表述，将矩阵计算减少到只有1/3。同时，对GPU架构采用了一种改进的tile - pcr - pthomas算法，在求解单位矢量三对角线系统时，通过仔细的相关性分析，GPU上的共享内存使用量可以减少1/3。我们在天河1a上的实验表明，在最大运行时求解1920M规模的三对角系统时，在128个节点上具有理想的弱可扩展性;在求解480M规模的三对角系统时，在32个节点到256个节点上具有良好的强可扩展性(70%)。此外，在256个节点的强扩展测试中，跨CPU和GPU的自适应任务分区可以获得10%以上的性能提升。在天河1a的一个计算节点上，我们的纯gpu代码比CUSPARSE版本(非旋转三对角解算器)的性能高30%，对于3M、5M和15M的三对角系统，我们的混合代码比Intel SPIKE多进程版本快6.7倍左右。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Hierarchical Tridiagonal System Solver for Heterogenous Supercomputers

Tridiagonal system solver is an important kernel in many scientific and engineering applications. Even though quite a few parallel algorithms and implementations have been addressed in recent years, challenges still remain when solving large-scale tridiagonal system on heterogenous supercomputers. In this paper, a hierarchical algorithm framework SPIKE (pronounced 'SPIKE squared') is proposed to minimize the parallel overhead and to achieve the best utilization of CPU-GPU hybrid systems. In these systems, a layered and adaptive partitioning is presented based on the SPIKE algorithm to effectively control the sequential parts while efficiently exploiting the computation and communication overlapping in heterogeneous computing node. Moreover, the SPIKE algorithm is reformulated to reduce the matrix computations to only 1/3 in our hierarchical algorithm framework. Meanwhile, an improved implementation of the tiled-PCR-pThomas algorithm is employed for the GPU architecture, and the shared memory usage on the GPU can be reduced by 1/3 using careful dependence analysis on solving unit vector tridiagonal systems. Our experiments on Tianhe-1A show ideal weak scalability on up to 128 nodes when solving a tridiagonal system with a size of 1920M in the largest run and good strong scalability (70%) from 32 nodes to 256 nodes when solving a tridiagonal system with a size of 480M. Furthermore, the adaptive task partition across the CPU and GPU can get over 10% performance improvement in the strong scaling test with 256 nodes. In one computing node of Tianhe-1A, our GPU-only code can outperform the CUSPARSE version (non-pivoting tridiagonal solver) by 30%, and our hybrid code is about 6.7 times faster than the Intel SPIKE multi-process version for tridiagonal systems having a size of 3M, 5M, and 15M.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

自引率

0.00%

发文量