International Journal of High Performance Computing Applications最新文献

筛选
英文 中文
Compressed basis GMRES on high-performance graphics processing units 在高性能图形处理单元上压缩基础GMRES
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-08-05 DOI: 10.1177/10943420221115140
J. Aliaga, H. Anzt, Thomas Grützmacher, E. S. Quintana‐Ortí, A. Tomás
{"title":"Compressed basis GMRES on high-performance graphics processing units","authors":"J. Aliaga, H. Anzt, Thomas Grützmacher, E. S. Quintana‐Ortí, A. Tomás","doi":"10.1177/10943420221115140","DOIUrl":"https://doi.org/10.1177/10943420221115140","url":null,"abstract":"Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in current computer architectures, motivating the investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This article leverages Ginkgo’s memory accessor in order to integrate a communication-reduction strategy into the (Krylov) GMRES solver that decouples the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory accesses, the cost of the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a decrease in the volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating-point as well as fixed-point formats with mild impact on the convergence of the iterative process. We develop a high-performance implementation of the “compressed basis GMRES” solver in the Ginkgo sparse linear algebra library using a large set of test problems from the SuiteSparse Matrix Collection. We demonstrate robustness and performance advantages on a modern NVIDIA V100 graphics processing unit (GPU) of up to 50% over the standard GMRES solver that stores all data in IEEE double-precision.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"82 - 100"},"PeriodicalIF":3.1,"publicationDate":"2022-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42489689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Corrigendum to ‘Unprecedented cloud resolution in a GPU-enabled full-physics atmospheric climate simulation on OLCF’s summit supercomputer’ 更正“OLCF峰会超级计算机上GPU支持的全物理大气气候模拟中前所未有的云分辨率”
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-07-01 DOI: 10.1177/10943420221103014
M. Norman
{"title":"Corrigendum to ‘Unprecedented cloud resolution in a GPU-enabled full-physics atmospheric climate simulation on OLCF’s summit supercomputer’","authors":"M. Norman","doi":"10.1177/10943420221103014","DOIUrl":"https://doi.org/10.1177/10943420221103014","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"564 - 564"},"PeriodicalIF":3.1,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47297709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-Scale direct numerical simulations of turbulence using GPUs and modern Fortran 使用gpu和现代Fortran的大规模直接数值模拟湍流
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-06-23 DOI: 10.1177/10943420231158616
Martin Karp, D. Massaro, Niclas Jansson, A. Hart, Jacob Wahlgren, P. Schlatter, S. Markidis
{"title":"Large-Scale direct numerical simulations of turbulence using GPUs and modern Fortran","authors":"Martin Karp, D. Massaro, Niclas Jansson, A. Hart, Jacob Wahlgren, P. Schlatter, S. Markidis","doi":"10.1177/10943420231158616","DOIUrl":"https://doi.org/10.1177/10943420231158616","url":null,"abstract":"We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world’s first direct numerical simulation of the flow around a Flettner rotor at Re = 30,000 and its interaction with a turbulent boundary layer. We present a performance comparison between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency based on readings from on-chip energy sensors.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"487 - 502"},"PeriodicalIF":3.1,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42194182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Accelerating physics simulations with tensor processing units: An inundation modeling example 用张量处理单元加速物理模拟:一个洪水建模的例子
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-06-03 DOI: 10.1177/10943420221102873
R. Hu, D. Pierce, Yusef Shafi, Anudhyan Boral, V. Anisimov, Sella Nevo, Yi-Fan Chen
{"title":"Accelerating physics simulations with tensor processing units: An inundation modeling example","authors":"R. Hu, D. Pierce, Yusef Shafi, Anudhyan Boral, V. Anisimov, Sella Nevo, Yi-Fan Chen","doi":"10.1177/10943420221102873","DOIUrl":"https://doi.org/10.1177/10943420221102873","url":null,"abstract":"Recent advancements in hardware accelerators such as Tensor Processing Units (TPUs) speed up computation time relative to Central Processing Units (CPUs) not only for machine learning but, as demonstrated here, also for scientific modeling and computer simulations. To study TPU hardware for distributed scientific computing, we solve partial differential equations (PDEs) for the physics simulation of fluids to model riverine floods. We demonstrate that TPUs achieve a two orders of magnitude speedup over CPUs. Running physics simulations on TPUs is publicly accessible via the Google Cloud Platform, and we release a Python interactive notebook version of the simulation.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"510 - 523"},"PeriodicalIF":3.1,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41458120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics 从头算分子动力学中电子结构问题的e级突破
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-05-24 DOI: 10.1177/10943420231177631
Robert Schade, Tobias Kenter, Hossam Elgabarty, Michael Lass, T. Kühne, Christian Plessl
{"title":"Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics","authors":"Robert Schade, Tobias Kenter, Hossam Elgabarty, Michael Lass, T. Kühne, Christian Plessl","doi":"10.1177/10943420231177631","DOIUrl":"https://doi.org/10.1177/10943420231177631","url":null,"abstract":"The non-orthogonal local submatrix method applied to electronic structure–based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32-mixed floating-point arithmetic when using 4400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"530 - 538"},"PeriodicalIF":3.1,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46337516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations 提高高阶无矩阵有限元实现的共轭梯度法的数据局部性
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-05-18 DOI: 10.1177/10943420221107880
M. Kronbichler, D. Sashko, Peter Munch
{"title":"Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations","authors":"M. Kronbichler, D. Sashko, Peter Munch","doi":"10.1177/10943420221107880","DOIUrl":"https://doi.org/10.1177/10943420221107880","url":null,"abstract":"This work investigates a variant of the conjugate gradient (CG) method and embeds it into the context of high-order finite-element schemes with fast matrix-free operator evaluation and cheap preconditioners like the matrix diagonal. Relying on a data-dependency analysis and appropriate enumeration of degrees of freedom, we interleave the vector updates and inner products in a CG iteration with the matrix-vector product with only minor organizational overhead. As a result, around 90% of the vector entries of the three active vectors of the CG method are transferred from slow RAM memory exactly once per iteration, with all additional access hitting fast cache memory. Node-level performance analyses and scaling studies on up to 147k cores show that the CG method with the proposed performance optimizations is around two times faster than a standard CG solver as well as optimized pipelined CG and s-step CG methods for large sizes that exceed processor caches, and provides similar performance near the strong scaling limit.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"61 - 81"},"PeriodicalIF":3.1,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44370332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Performance analysis of relaxation Runge–Kutta methods 松弛龙格-库塔方法的性能分析
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-05-12 DOI: 10.1177/10943420221085947
M. Rogowski, Lisandro Dalcin, M. Parsani, D. Keyes
{"title":"Performance analysis of relaxation Runge–Kutta methods","authors":"M. Rogowski, Lisandro Dalcin, M. Parsani, D. Keyes","doi":"10.1177/10943420221085947","DOIUrl":"https://doi.org/10.1177/10943420221085947","url":null,"abstract":"Recently, global and local relaxation Runge–Kutta methods have been developed for guaranteeing the conservation, dissipation, or other solution properties for general convex functionals whose dynamics are crucial for an ordinary differential equation solution. These novel time integration procedures have an application in a wide range of problems that require dynamics-consistent and stable numerical methods. The application of a relaxation scheme involves solving scalar nonlinear algebraic equations to find the relaxation parameter. Even though root-finding may seem to be a problem technically straightforward and computationally insignificant, we address the problem at scale as we solve full-scale industrial problems on a CPU-powered supercomputer and show its cost to be considerable. In particular, we apply the relaxation schemes in the context of the compressible Navier–Stokes equations and use them to enforce the correct entropy evolution. We use seven different algorithms to solve for the global and local relaxation parameters and analyze their strong scalability. As a result of this analysis, within the global relaxation scheme, we recommend using Brent’s method for problems with a low polynomial degree and of small sizes for the global relaxation scheme, while secant proves to be the best choice for higher polynomial degree solutions and large problem sizes. For the local relaxation scheme, we recommend secant. Further, we compare the schemes’ performance using their most efficient implementations, where we look at their effect on the timestep size, overhead, and weak scalability. We show the global relaxation scheme to be always more expensive than the local approach—typically 1.1–1.5 times the cost. At the same time, we highlight scenarios where the global relaxation scheme might underperform due to its increased communication requirements. Finally, we present an analysis that sets expectations on the computational overhead anticipated based on the system properties.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"524 - 542"},"PeriodicalIF":3.1,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44137336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Very fast finite element Poisson solvers on lower precision accelerator hardware: A proof of concept study for Nvidia Tesla V100 低精度加速器硬件上的快速有限元泊松求解器:Nvidia Tesla V100的概念验证研究
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-05-06 DOI: 10.1177/10943420221084657
D. Ruda, S. Turek, D. Ribbrock, P. Zajác
{"title":"Very fast finite element Poisson solvers on lower precision accelerator hardware: A proof of concept study for Nvidia Tesla V100","authors":"D. Ruda, S. Turek, D. Ribbrock, P. Zajác","doi":"10.1177/10943420221084657","DOIUrl":"https://doi.org/10.1177/10943420221084657","url":null,"abstract":"Recently, accelerator hardware in the form of graphics cards including Tensor Cores, specialized for AI, has significantly gained importance in the domain of high-performance computing. For example, NVIDIA’s Tesla V100 promises a computing power of up to 125 TFLOP/s achieved by Tensor Cores, but only if half precision floating point format is used. We describe the difficulties and discrepancy between theoretical and actual computing power if one seeks to use such hardware for numerical simulations, that is, solving partial differential equations with a matrix-based finite element method, with numerical examples. If certain requirements, namely low condition numbers and many dense matrix operations, are met, the indicated high performance can be reached without an excessive loss of accuracy. A new method to solve linear systems arising from Poisson’s equation in 2D that meets these requirements, based on “prehandling” by means of hier-archical finite elements and an additional Schur complement approach, is presented and analyzed. We provide numerical results illustrating the computational performance of this method and compare it to a commonly used (geometric) multigrid solver on standard hardware. It turns out that we can exploit nearly the full computational power of Tensor Cores and achieve a significant speed-up compared to the standard methodology without losing accuracy.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"459 - 474"},"PeriodicalIF":3.1,"publicationDate":"2022-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43024478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Performance portability in a real world application: PHAST applied to Caffe 实际应用程序中的性能可移植性:PHAST应用于Caffe
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-05-01 DOI: 10.1177/10943420221077107
Pablo Antonio Martínez, Biagio Peccerillo, S. Bartolini, J. M. García, G. Bernabé
{"title":"Performance portability in a real world application: PHAST applied to Caffe","authors":"Pablo Antonio Martínez, Biagio Peccerillo, S. Bartolini, J. M. García, G. Bernabé","doi":"10.1177/10943420221077107","DOIUrl":"https://doi.org/10.1177/10943420221077107","url":null,"abstract":"This work covers the PHAST Library’s employment, a hardware-agnostic programming library, to a real-world application like the Caffe framework. The original implementation of Caffe consists of two different versions of the source code: one to run on CPU platforms and another one to run on the GPU side. With PHAST, we aim to develop a single-source code implementation capable of running efficiently on CPU and GPU. In this paper, we start by carrying out a basic Caffe implementation performance analysis using PHAST. Then, we detail possible performance upgrades. We find that the overall performance is dominated by few ‘heavy’ layers. In refining the inefficient parts of this version, we find two different approaches: improvements to the Caffe source code and improvements to the PHAST Library itself, which ultimately translates into improved performance in the PHAST version of Caffe. We demonstrate that our PHAST implementation achieves performance portability on CPUs and GPUs. With a single source, the PHAST version of Caffe provides the same or even better performance than the original version of Caffe built from two different codebases. For the MNIST database, the PHAST implementation takes an equivalent amount of time as native code in CPU and GPU. Furthermore, PHAST achieves a speedup of 51% and a 49% with the CIFAR-10 database against native code in CPU and GPU, respectively. These results provide a new horizon for software development in the upcoming heterogeneous computing era.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"419 - 439"},"PeriodicalIF":3.1,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49259162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Performance portable ice-sheet modeling with MALI 性能便携式冰盖建模与马里
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-04-08 DOI: 10.1177/10943420231183688
Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price
{"title":"Performance portable ice-sheet modeling with MALI","authors":"Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price","doi":"10.1177/10943420231183688","DOIUrl":"https://doi.org/10.1177/10943420231183688","url":null,"abstract":"High-resolution simulations of polar ice sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution; therefore, it has become essential to ensure performance on these platforms. Many of today’s supercomputers contain a diverse set of computing architectures and require specific programming interfaces in order to obtain optimal efficiency. In an effort to avoid architecture-specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MPAS-Albany Land Ice (MALI) uses high-level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this article, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights not only the performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26 and 1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We perform a weak scalability study and show that simulations on GPU-based machines perform 1.24–1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly, which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.6% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of 2 years of development.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"600 - 625"},"PeriodicalIF":3.1,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41395091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信