International Journal of High Performance Computing Applications最新文献_第4页

Myths and legends in high-performance computing 高性能计算的神话和传说

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2023-01-06 DOI: 10.1177/10943420231166608

S. Matsuoka, Jens Domke, M. Wahib, Aleksandr Drozd, T. Hoefler

{"title":"Myths and legends in high-performance computing","authors":"S. Matsuoka, Jens Domke, M. Wahib, Aleksandr Drozd, T. Hoefler","doi":"10.1177/10943420231166608","DOIUrl":"https://doi.org/10.1177/10943420231166608","url":null,"abstract":"In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore’s law. While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research. Nevertheless, these myths are rarely based on scientific facts, but rather on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates, such as whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"245 - 259"},"PeriodicalIF":3.1,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42989305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint GPU张量核上的混合精度LU因子分解：减少数据移动和内存占用

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2023-01-03 DOI: 10.1177/10943420221136848

Florent Lopez, Théo Mary

{"title":"Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint","authors":"Florent Lopez, Théo Mary","doi":"10.1177/10943420221136848","DOIUrl":"https://doi.org/10.1177/10943420221136848","url":null,"abstract":"Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2× and 3.5× speedups on V100 and A100 GPUs, respectively.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"165 - 179"},"PeriodicalIF":3.1,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42887017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings 高性能计算:第38届国际会议，ISC高性能2023，汉堡，德国，2023年5月21-25日，论文集

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2023-01-01 DOI: 10.1007/978-3-031-32041-5

引用次数: 0

Special issue: Introduction 特刊:简介

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2023-01-01 DOI: 10.1177/10943420221150081

M. Parsons

引用次数: 0

Performance comparison of the A-grid and C-grid shallow-water models on icosahedral grids 二十面体网格上a网格和c网格浅水模型的性能比较

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-11-15 DOI: 10.1177/10943420221139509

J. Middlecoff, Yonggang G. Yu, M. Govett

引用次数: 0

Acceleration of a parallel BDDC solver by using graphics processing units on subdomains 使用子域上的图形处理单元加速并行BDDC求解器

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-11-05 DOI: 10.1177/10943420221136873

J. Šístek, T. Oberhuber

{"title":"Acceleration of a parallel BDDC solver by using graphics processing units on subdomains","authors":"J. Šístek, T. Oberhuber","doi":"10.1177/10943420221136873","DOIUrl":"https://doi.org/10.1177/10943420221136873","url":null,"abstract":"An approach to accelerating a parallel domain decomposition (DD) solver by graphics processing units (GPUs) is investigated. The solver is based on the Balancing Domain Decomposition Method by Constraints (BDDC), which is a nonoverlapping DD technique. Two kinds of local matrices are required by BDDC. First, dense matrices corresponding to local Schur complements of interior unknowns are constructed by the sparse direct solver. These are further used as part of the local saddle-point problems within BDDC. In the next step, the local matrices are copied to GPUs. Repeated multiplications of local vectors with the dense matrix of the Schur complement are performed for each subdomain. In addition, factorizations and backsubstitutions with the dense saddle-point subdomain matrices are also performed on GPUs. Detailed times of main components of the algorithm are measured on a benchmark Poisson problem. The method is also applied to an unsteady problem of incompressible flow, where the Krylov subspace iterations are performed repeatedly in each time step. The results demonstrate the potential of the approach to speed up realistic simulations up to 5 times with a preference towards large subdomains.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"151 - 164"},"PeriodicalIF":3.1,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42735563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations 高阶有限元离散化低阶精细预处理的端到端GPU加速

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-10-21 DOI: 10.1177/10943420231175462

Will Pazner, T. Kolev, Jean-Sylvain Camier

{"title":"End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations","authors":"Will Pazner, T. Kolev, Jean-Sylvain Camier","doi":"10.1177/10943420231175462","DOIUrl":"https://doi.org/10.1177/10943420231175462","url":null,"abstract":"In this article, we present algorithms and implementations for the end-to-end GPU acceleration of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods described here allow for the construction of effective preconditioners for high-order problems with optimal memory usage and computational complexity. The preconditioners are based on the construction of a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for example, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and polynomial degree. For vector finite element problems in H (curl) and H (div) (e.g., for electromagnetic or radiation diffusion problems), a specially constructed interpolation–histopolation basis is used to ensure fast convergence. Detailed performance studies are carried out to analyze the efficiency of the GPU algorithms. The kernel throughput of each of the main algorithmic components is measured, and the strong and weak parallel scalability of the methods is demonstrated. The different relative weighting and significance of the algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively refined nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diffusion problem using all spaces of the finite element de Rham complex is illustrated.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"578 - 599"},"PeriodicalIF":3.1,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45709588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Exploiting temporal data reuse and asynchrony in the reverse time migration 在反向时间迁移中利用时态数据重用和异步

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-10-03 DOI: 10.1177/10943420221128529

L. Qu, Rached Abdelkhalak, H. Ltaief, Issam Said, D. Keyes

{"title":"Exploiting temporal data reuse and asynchrony in the reverse time migration","authors":"L. Qu, Rached Abdelkhalak, H. Ltaief, Issam Said, D. Keyes","doi":"10.1177/10943420221128529","DOIUrl":"https://doi.org/10.1177/10943420221128529","url":null,"abstract":"Reverse Time Migration (RTM) is a state-of-the-art algorithm used in seismic depth imaging in complex geological environments for the oil and gas exploration industry. It calculates high-resolution images by solving the three-dimensional acoustic wave equation using seismic datasets recorded at various receiver locations. Reverse Time Migration’s computational phases are predominantly composed of stencil computational kernels for the finite-difference time-domain scheme, applying the absorbing boundary conditions, and I/O operations needed for the imaging condition. In this paper, we integrate the asynchronous Multicore Wavefront Diamond (MWD) tiling approach into the full RTM workflow. Multicore Wavefront Diamond permits to further increase data reuse by leveraging spatial with Temporal Blocking (TB) during the stencil computations. This integration engenders new challenges with a snowball effect on the legacy synchronous RTM workflow as it requires rethinking of how the absorbing boundary conditions, the I/O operations, and the imaging condition operate. These disruptive changes are necessary to maintain the performance superiority of asynchronous stencil execution throughout the time integration, while ensuring the quality of the subsurface image does not deteriorate. We assess the overall performance of the new MWD-based RTM and compare against traditional Spatial Blocking (SB)-based RTM on various shared-memory systems using the SEG Salt3D model. The MWD-based RTM achieves up to 70% performance speedup compared to SB-based RTM. To our knowledge, this paper highlights for the first time the applicability of asynchronous executions with temporal blocking throughout the whole RTM. This may eventually create new research opportunities in improving hydrocarbon extraction for the petroleum industry.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"132 - 150"},"PeriodicalIF":3.1,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43710430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

PeleC: An adaptive mesh refinement solver for compressible reacting flows PeleC:可压缩反应流的自适应网格细化求解器

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-09-06 DOI: 10.1177/10943420221121151

M. T. Henry de Frahan, Jonathan S. Rood, M. Day, H. Sitaraman, S. Yellapantula, Bruce A. Perry, R. Grout, A. Almgren, Weiqun Zhang, J. Bell, Jacqueline H. Chen

{"title":"PeleC: An adaptive mesh refinement solver for compressible reacting flows","authors":"M. T. Henry de Frahan, Jonathan S. Rood, M. Day, H. Sitaraman, S. Yellapantula, Bruce A. Perry, R. Grout, A. Almgren, Weiqun Zhang, J. Bell, Jacqueline H. Chen","doi":"10.1177/10943420221121151","DOIUrl":"https://doi.org/10.1177/10943420221121151","url":null,"abstract":"Reacting flow simulations for combustion applications require extensive computing capabilities. Leveraging the AMReX library, the Pele suite of combustion simulation tools targets the largest supercomputers available and future exascale machines. We introduce PeleC, the compressible solver in the Pele suite, and detail its capabilities, including complex geometry representation, chemistry integration, and discretization. We present a comparison of development efforts using both OpenACC and AMReX’s C++ performance portability framework for execution on multiple GPU architectures. We discuss relevant details that have allowed PeleC to achieve high performance and scalability. PeleC’s performance characteristics are measured through relevant simulations on multiple supercomputers. The success of PeleC’s design for exascale is exhibited through demonstration of a 160 billion cell simulation and weak scaling onto 100% of Summit, an NVIDIA-based GPU supercomputer at Oak Ridge National Laboratory. Our results provide confidence that PeleC will enable future combustion science simulations with unprecedented fidelity.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"115 - 131"},"PeriodicalIF":3.1,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45540928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Enabling efficient execution of a variational data assimilation application 支持有效地执行变分数据同化应用程序

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-08-28 DOI: 10.1177/10943420221119801

J. Dennis, A. Baker, B. Dobbins, M. Bell, Jian Sun, Youngsung Kim, Ting-Yu Cha

引用次数: 0