International Journal of High Performance Computing Applications最新文献_第5页

Compressed basis GMRES on high-performance graphics processing units 在高性能图形处理单元上压缩基础GMRES

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-08-05 DOI: 10.1177/10943420221115140

J. Aliaga, H. Anzt, Thomas Grützmacher, E. S. Quintana‐Ortí, A. Tomás

{"title":"Compressed basis GMRES on high-performance graphics processing units","authors":"J. Aliaga, H. Anzt, Thomas Grützmacher, E. S. Quintana‐Ortí, A. Tomás","doi":"10.1177/10943420221115140","DOIUrl":"https://doi.org/10.1177/10943420221115140","url":null,"abstract":"Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in current computer architectures, motivating the investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This article leverages Ginkgo’s memory accessor in order to integrate a communication-reduction strategy into the (Krylov) GMRES solver that decouples the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory accesses, the cost of the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a decrease in the volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating-point as well as fixed-point formats with mild impact on the convergence of the iterative process. We develop a high-performance implementation of the “compressed basis GMRES” solver in the Ginkgo sparse linear algebra library using a large set of test problems from the SuiteSparse Matrix Collection. We demonstrate robustness and performance advantages on a modern NVIDIA V100 graphics processing unit (GPU) of up to 50% over the standard GMRES solver that stores all data in IEEE double-precision.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"82 - 100"},"PeriodicalIF":3.1,"publicationDate":"2022-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42489689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Corrigendum to ‘Unprecedented cloud resolution in a GPU-enabled full-physics atmospheric climate simulation on OLCF’s summit supercomputer’ 更正“OLCF峰会超级计算机上GPU支持的全物理大气气候模拟中前所未有的云分辨率”

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-07-01 DOI: 10.1177/10943420221103014

M. Norman

引用次数: 0

Large-Scale direct numerical simulations of turbulence using GPUs and modern Fortran 使用gpu和现代Fortran的大规模直接数值模拟湍流

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-06-23 DOI: 10.1177/10943420231158616

Martin Karp, D. Massaro, Niclas Jansson, A. Hart, Jacob Wahlgren, P. Schlatter, S. Markidis

引用次数: 3

Accelerating physics simulations with tensor processing units: An inundation modeling example 用张量处理单元加速物理模拟:一个洪水建模的例子

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-06-03 DOI: 10.1177/10943420221102873

R. Hu, D. Pierce, Yusef Shafi, Anudhyan Boral, V. Anisimov, Sella Nevo, Yi-Fan Chen

引用次数: 6

Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics 从头算分子动力学中电子结构问题的e级突破

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-05-24 DOI: 10.1177/10943420231177631

Robert Schade, Tobias Kenter, Hossam Elgabarty, Michael Lass, T. Kühne, Christian Plessl

引用次数: 4

Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations 提高高阶无矩阵有限元实现的共轭梯度法的数据局部性

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-05-18 DOI: 10.1177/10943420221107880

M. Kronbichler, D. Sashko, Peter Munch

引用次数: 13

Performance analysis of relaxation Runge–Kutta methods 松弛龙格-库塔方法的性能分析

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-05-12 DOI: 10.1177/10943420221085947

M. Rogowski, Lisandro Dalcin, M. Parsani, D. Keyes

{"title":"Performance analysis of relaxation Runge–Kutta methods","authors":"M. Rogowski, Lisandro Dalcin, M. Parsani, D. Keyes","doi":"10.1177/10943420221085947","DOIUrl":"https://doi.org/10.1177/10943420221085947","url":null,"abstract":"Recently, global and local relaxation Runge–Kutta methods have been developed for guaranteeing the conservation, dissipation, or other solution properties for general convex functionals whose dynamics are crucial for an ordinary differential equation solution. These novel time integration procedures have an application in a wide range of problems that require dynamics-consistent and stable numerical methods. The application of a relaxation scheme involves solving scalar nonlinear algebraic equations to find the relaxation parameter. Even though root-finding may seem to be a problem technically straightforward and computationally insignificant, we address the problem at scale as we solve full-scale industrial problems on a CPU-powered supercomputer and show its cost to be considerable. In particular, we apply the relaxation schemes in the context of the compressible Navier–Stokes equations and use them to enforce the correct entropy evolution. We use seven different algorithms to solve for the global and local relaxation parameters and analyze their strong scalability. As a result of this analysis, within the global relaxation scheme, we recommend using Brent’s method for problems with a low polynomial degree and of small sizes for the global relaxation scheme, while secant proves to be the best choice for higher polynomial degree solutions and large problem sizes. For the local relaxation scheme, we recommend secant. Further, we compare the schemes’ performance using their most efficient implementations, where we look at their effect on the timestep size, overhead, and weak scalability. We show the global relaxation scheme to be always more expensive than the local approach—typically 1.1–1.5 times the cost. At the same time, we highlight scenarios where the global relaxation scheme might underperform due to its increased communication requirements. Finally, we present an analysis that sets expectations on the computational overhead anticipated based on the system properties.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"524 - 542"},"PeriodicalIF":3.1,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44137336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Very fast finite element Poisson solvers on lower precision accelerator hardware: A proof of concept study for Nvidia Tesla V100 低精度加速器硬件上的快速有限元泊松求解器:Nvidia Tesla V100的概念验证研究

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-05-06 DOI: 10.1177/10943420221084657

D. Ruda, S. Turek, D. Ribbrock, P. Zajác

{"title":"Very fast finite element Poisson solvers on lower precision accelerator hardware: A proof of concept study for Nvidia Tesla V100","authors":"D. Ruda, S. Turek, D. Ribbrock, P. Zajác","doi":"10.1177/10943420221084657","DOIUrl":"https://doi.org/10.1177/10943420221084657","url":null,"abstract":"Recently, accelerator hardware in the form of graphics cards including Tensor Cores, specialized for AI, has significantly gained importance in the domain of high-performance computing. For example, NVIDIA’s Tesla V100 promises a computing power of up to 125 TFLOP/s achieved by Tensor Cores, but only if half precision floating point format is used. We describe the difficulties and discrepancy between theoretical and actual computing power if one seeks to use such hardware for numerical simulations, that is, solving partial differential equations with a matrix-based finite element method, with numerical examples. If certain requirements, namely low condition numbers and many dense matrix operations, are met, the indicated high performance can be reached without an excessive loss of accuracy. A new method to solve linear systems arising from Poisson’s equation in 2D that meets these requirements, based on “prehandling” by means of hier-archical finite elements and an additional Schur complement approach, is presented and analyzed. We provide numerical results illustrating the computational performance of this method and compare it to a commonly used (geometric) multigrid solver on standard hardware. It turns out that we can exploit nearly the full computational power of Tensor Cores and achieve a significant speed-up compared to the standard methodology without losing accuracy.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"459 - 474"},"PeriodicalIF":3.1,"publicationDate":"2022-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43024478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Performance portability in a real world application: PHAST applied to Caffe 实际应用程序中的性能可移植性:PHAST应用于Caffe

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-05-01 DOI: 10.1177/10943420221077107

Pablo Antonio Martínez, Biagio Peccerillo, S. Bartolini, J. M. García, G. Bernabé

{"title":"Performance portability in a real world application: PHAST applied to Caffe","authors":"Pablo Antonio Martínez, Biagio Peccerillo, S. Bartolini, J. M. García, G. Bernabé","doi":"10.1177/10943420221077107","DOIUrl":"https://doi.org/10.1177/10943420221077107","url":null,"abstract":"This work covers the PHAST Library’s employment, a hardware-agnostic programming library, to a real-world application like the Caffe framework. The original implementation of Caffe consists of two different versions of the source code: one to run on CPU platforms and another one to run on the GPU side. With PHAST, we aim to develop a single-source code implementation capable of running efficiently on CPU and GPU. In this paper, we start by carrying out a basic Caffe implementation performance analysis using PHAST. Then, we detail possible performance upgrades. We find that the overall performance is dominated by few ‘heavy’ layers. In refining the inefficient parts of this version, we find two different approaches: improvements to the Caffe source code and improvements to the PHAST Library itself, which ultimately translates into improved performance in the PHAST version of Caffe. We demonstrate that our PHAST implementation achieves performance portability on CPUs and GPUs. With a single source, the PHAST version of Caffe provides the same or even better performance than the original version of Caffe built from two different codebases. For the MNIST database, the PHAST implementation takes an equivalent amount of time as native code in CPU and GPU. Furthermore, PHAST achieves a speedup of 51% and a 49% with the CIFAR-10 database against native code in CPU and GPU, respectively. These results provide a new horizon for software development in the upcoming heterogeneous computing era.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"419 - 439"},"PeriodicalIF":3.1,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49259162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Performance portable ice-sheet modeling with MALI 性能便携式冰盖建模与马里

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-04-08 DOI: 10.1177/10943420231183688

Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price

{"title":"Performance portable ice-sheet modeling with MALI","authors":"Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price","doi":"10.1177/10943420231183688","DOIUrl":"https://doi.org/10.1177/10943420231183688","url":null,"abstract":"High-resolution simulations of polar ice sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution; therefore, it has become essential to ensure performance on these platforms. Many of today’s supercomputers contain a diverse set of computing architectures and require specific programming interfaces in order to obtain optimal efficiency. In an effort to avoid architecture-specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MPAS-Albany Land Ice (MALI) uses high-level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this article, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights not only the performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26 and 1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We perform a weak scalability study and show that simulations on GPU-based machines perform 1.24–1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly, which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.6% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of 2 years of development.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"600 - 625"},"PeriodicalIF":3.1,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41395091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2