International Journal of High Performance Computing Applications最新文献

筛选
英文 中文
Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint GPU张量核上的混合精度LU因子分解:减少数据移动和内存占用
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2023-01-03 DOI: 10.1177/10943420221136848
Florent Lopez, Théo Mary
{"title":"Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint","authors":"Florent Lopez, Théo Mary","doi":"10.1177/10943420221136848","DOIUrl":"https://doi.org/10.1177/10943420221136848","url":null,"abstract":"Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2× and 3.5× speedups on V100 and A100 GPUs, respectively.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"165 - 179"},"PeriodicalIF":3.1,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42887017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings 高性能计算:第38届国际会议,ISC高性能2023,汉堡,德国,2023年5月21-25日,论文集
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2023-01-01 DOI: 10.1007/978-3-031-32041-5
{"title":"High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings","authors":"","doi":"10.1007/978-3-031-32041-5","DOIUrl":"https://doi.org/10.1007/978-3-031-32041-5","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"14 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88116938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Special issue: Introduction 特刊:简介
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2023-01-01 DOI: 10.1177/10943420221150081
M. Parsons
{"title":"Special issue: Introduction","authors":"M. Parsons","doi":"10.1177/10943420221150081","DOIUrl":"https://doi.org/10.1177/10943420221150081","url":null,"abstract":"The COVID pandemic has changed all of our lives and continues to do so. The prizes recognise outstanding research achievement toward the understanding of the COVID-19 pandemic through the use of high-performance computing. The winning paper, entitled 'Digital transformation of droplet/aerosol infection risk assessment realised on \"Fugaku\" for the fight against COVID-19', was submitted by a team from the RIKEN Center for Computational Science in Japan. [Extracted from the article]","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"3 - 3"},"PeriodicalIF":3.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45540807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance comparison of the A-grid and C-grid shallow-water models on icosahedral grids 二十面体网格上a网格和c网格浅水模型的性能比较
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-11-15 DOI: 10.1177/10943420221139509
J. Middlecoff, Yonggang G. Yu, M. Govett
{"title":"Performance comparison of the A-grid and C-grid shallow-water models on icosahedral grids","authors":"J. Middlecoff, Yonggang G. Yu, M. Govett","doi":"10.1177/10943420221139509","DOIUrl":"https://doi.org/10.1177/10943420221139509","url":null,"abstract":"This study uses a single software framework to compare the CPU performance of Arakawa A-grid (NICAM) and C-grid (MPAS) schemes for solving the shallow-water equations on icosahedral grids. The focus is on high-resolution weather prediction. Performance analysis shows the simpler structure of the A-grid equations enables compiler optimization-based efficiency gains that the C-grid equations cannot match. Strong scaling runs at 3.5 km resolution show the A-grid is three times faster than the C-grid, enabling the A-grid to run at 50% higher resolution in only 15% more time. A performance comparison with the MPAS shallow-water model is included which demonstrates that our software implementation of the C-grid is robust and comparisons are fair.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"197 - 208"},"PeriodicalIF":3.1,"publicationDate":"2022-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44206468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acceleration of a parallel BDDC solver by using graphics processing units on subdomains 使用子域上的图形处理单元加速并行BDDC求解器
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-11-05 DOI: 10.1177/10943420221136873
J. Šístek, T. Oberhuber
{"title":"Acceleration of a parallel BDDC solver by using graphics processing units on subdomains","authors":"J. Šístek, T. Oberhuber","doi":"10.1177/10943420221136873","DOIUrl":"https://doi.org/10.1177/10943420221136873","url":null,"abstract":"An approach to accelerating a parallel domain decomposition (DD) solver by graphics processing units (GPUs) is investigated. The solver is based on the Balancing Domain Decomposition Method by Constraints (BDDC), which is a nonoverlapping DD technique. Two kinds of local matrices are required by BDDC. First, dense matrices corresponding to local Schur complements of interior unknowns are constructed by the sparse direct solver. These are further used as part of the local saddle-point problems within BDDC. In the next step, the local matrices are copied to GPUs. Repeated multiplications of local vectors with the dense matrix of the Schur complement are performed for each subdomain. In addition, factorizations and backsubstitutions with the dense saddle-point subdomain matrices are also performed on GPUs. Detailed times of main components of the algorithm are measured on a benchmark Poisson problem. The method is also applied to an unsteady problem of incompressible flow, where the Krylov subspace iterations are performed repeatedly in each time step. The results demonstrate the potential of the approach to speed up realistic simulations up to 5 times with a preference towards large subdomains.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"151 - 164"},"PeriodicalIF":3.1,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42735563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations 高阶有限元离散化低阶精细预处理的端到端GPU加速
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-10-21 DOI: 10.1177/10943420231175462
Will Pazner, T. Kolev, Jean-Sylvain Camier
{"title":"End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations","authors":"Will Pazner, T. Kolev, Jean-Sylvain Camier","doi":"10.1177/10943420231175462","DOIUrl":"https://doi.org/10.1177/10943420231175462","url":null,"abstract":"In this article, we present algorithms and implementations for the end-to-end GPU acceleration of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods described here allow for the construction of effective preconditioners for high-order problems with optimal memory usage and computational complexity. The preconditioners are based on the construction of a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for example, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and polynomial degree. For vector finite element problems in H (curl) and H (div) (e.g., for electromagnetic or radiation diffusion problems), a specially constructed interpolation–histopolation basis is used to ensure fast convergence. Detailed performance studies are carried out to analyze the efficiency of the GPU algorithms. The kernel throughput of each of the main algorithmic components is measured, and the strong and weak parallel scalability of the methods is demonstrated. The different relative weighting and significance of the algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively refined nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diffusion problem using all spaces of the finite element de Rham complex is illustrated.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"578 - 599"},"PeriodicalIF":3.1,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45709588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploiting temporal data reuse and asynchrony in the reverse time migration 在反向时间迁移中利用时态数据重用和异步
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-10-03 DOI: 10.1177/10943420221128529
L. Qu, Rached Abdelkhalak, H. Ltaief, Issam Said, D. Keyes
{"title":"Exploiting temporal data reuse and asynchrony in the reverse time migration","authors":"L. Qu, Rached Abdelkhalak, H. Ltaief, Issam Said, D. Keyes","doi":"10.1177/10943420221128529","DOIUrl":"https://doi.org/10.1177/10943420221128529","url":null,"abstract":"Reverse Time Migration (RTM) is a state-of-the-art algorithm used in seismic depth imaging in complex geological environments for the oil and gas exploration industry. It calculates high-resolution images by solving the three-dimensional acoustic wave equation using seismic datasets recorded at various receiver locations. Reverse Time Migration’s computational phases are predominantly composed of stencil computational kernels for the finite-difference time-domain scheme, applying the absorbing boundary conditions, and I/O operations needed for the imaging condition. In this paper, we integrate the asynchronous Multicore Wavefront Diamond (MWD) tiling approach into the full RTM workflow. Multicore Wavefront Diamond permits to further increase data reuse by leveraging spatial with Temporal Blocking (TB) during the stencil computations. This integration engenders new challenges with a snowball effect on the legacy synchronous RTM workflow as it requires rethinking of how the absorbing boundary conditions, the I/O operations, and the imaging condition operate. These disruptive changes are necessary to maintain the performance superiority of asynchronous stencil execution throughout the time integration, while ensuring the quality of the subsurface image does not deteriorate. We assess the overall performance of the new MWD-based RTM and compare against traditional Spatial Blocking (SB)-based RTM on various shared-memory systems using the SEG Salt3D model. The MWD-based RTM achieves up to 70% performance speedup compared to SB-based RTM. To our knowledge, this paper highlights for the first time the applicability of asynchronous executions with temporal blocking throughout the whole RTM. This may eventually create new research opportunities in improving hydrocarbon extraction for the petroleum industry.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"132 - 150"},"PeriodicalIF":3.1,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43710430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
PeleC: An adaptive mesh refinement solver for compressible reacting flows PeleC:可压缩反应流的自适应网格细化求解器
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-09-06 DOI: 10.1177/10943420221121151
M. T. Henry de Frahan, Jonathan S. Rood, M. Day, H. Sitaraman, S. Yellapantula, Bruce A. Perry, R. Grout, A. Almgren, Weiqun Zhang, J. Bell, Jacqueline H. Chen
{"title":"PeleC: An adaptive mesh refinement solver for compressible reacting flows","authors":"M. T. Henry de Frahan, Jonathan S. Rood, M. Day, H. Sitaraman, S. Yellapantula, Bruce A. Perry, R. Grout, A. Almgren, Weiqun Zhang, J. Bell, Jacqueline H. Chen","doi":"10.1177/10943420221121151","DOIUrl":"https://doi.org/10.1177/10943420221121151","url":null,"abstract":"Reacting flow simulations for combustion applications require extensive computing capabilities. Leveraging the AMReX library, the Pele suite of combustion simulation tools targets the largest supercomputers available and future exascale machines. We introduce PeleC, the compressible solver in the Pele suite, and detail its capabilities, including complex geometry representation, chemistry integration, and discretization. We present a comparison of development efforts using both OpenACC and AMReX’s C++ performance portability framework for execution on multiple GPU architectures. We discuss relevant details that have allowed PeleC to achieve high performance and scalability. PeleC’s performance characteristics are measured through relevant simulations on multiple supercomputers. The success of PeleC’s design for exascale is exhibited through demonstration of a 160 billion cell simulation and weak scaling onto 100% of Summit, an NVIDIA-based GPU supercomputer at Oak Ridge National Laboratory. Our results provide confidence that PeleC will enable future combustion science simulations with unprecedented fidelity.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"115 - 131"},"PeriodicalIF":3.1,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45540928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Enabling efficient execution of a variational data assimilation application 支持有效地执行变分数据同化应用程序
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-08-28 DOI: 10.1177/10943420221119801
J. Dennis, A. Baker, B. Dobbins, M. Bell, Jian Sun, Youngsung Kim, Ting-Yu Cha
{"title":"Enabling efficient execution of a variational data assimilation application","authors":"J. Dennis, A. Baker, B. Dobbins, M. Bell, Jian Sun, Youngsung Kim, Ting-Yu Cha","doi":"10.1177/10943420221119801","DOIUrl":"https://doi.org/10.1177/10943420221119801","url":null,"abstract":"Remote sensing observational instruments are critical for better understanding and predicting severe weather. Observational data from such instruments, such as Doppler radar data, for example, are often processed for assimilation into numerical weather prediction models. As such instruments become more sophisticated, the amount of data to be processed grows and requires efficient variational analysis tools. Here we examine the code that implements the popular SAMURAI (Spline Analysis at Mesoscale Utilizing Radar and Aircraft Instrumentation) technique for estimating the atmospheric state for a given set of observations. We employ a number of techniques to significantly improve the code’s performance, including porting it to run on standard HPC clusters, analyzing and optimizing its single-node performance, implementing a more efficient nonlinear optimization method, and enabling the use of GPUs via OpenACC. Our efforts thus far have yielded more than 100x improvement over the original code on large test problems of interest to the community.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"101 - 114"},"PeriodicalIF":3.1,"publicationDate":"2022-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45282216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Compressed basis GMRES on high-performance graphics processing units 在高性能图形处理单元上压缩基础GMRES
IF 3.1 3区 计算机科学
International Journal of High Performance Computing Applications Pub Date : 2022-08-05 DOI: 10.1177/10943420221115140
J. Aliaga, H. Anzt, Thomas Grützmacher, E. S. Quintana‐Ortí, A. Tomás
{"title":"Compressed basis GMRES on high-performance graphics processing units","authors":"J. Aliaga, H. Anzt, Thomas Grützmacher, E. S. Quintana‐Ortí, A. Tomás","doi":"10.1177/10943420221115140","DOIUrl":"https://doi.org/10.1177/10943420221115140","url":null,"abstract":"Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in current computer architectures, motivating the investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This article leverages Ginkgo’s memory accessor in order to integrate a communication-reduction strategy into the (Krylov) GMRES solver that decouples the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory accesses, the cost of the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a decrease in the volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating-point as well as fixed-point formats with mild impact on the convergence of the iterative process. We develop a high-performance implementation of the “compressed basis GMRES” solver in the Ginkgo sparse linear algebra library using a large set of test problems from the SuiteSparse Matrix Collection. We demonstrate robustness and performance advantages on a modern NVIDIA V100 graphics processing unit (GPU) of up to 50% over the standard GMRES solver that stores all data in IEEE double-precision.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"82 - 100"},"PeriodicalIF":3.1,"publicationDate":"2022-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42489689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信