S. Matsuoka, Jens Domke, M. Wahib, Aleksandr Drozd, T. Hoefler
{"title":"Myths and legends in high-performance computing","authors":"S. Matsuoka, Jens Domke, M. Wahib, Aleksandr Drozd, T. Hoefler","doi":"10.1177/10943420231166608","DOIUrl":"https://doi.org/10.1177/10943420231166608","url":null,"abstract":"In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore’s law. While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research. Nevertheless, these myths are rarely based on scientific facts, but rather on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates, such as whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"245 - 259"},"PeriodicalIF":3.1,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42989305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint","authors":"Florent Lopez, Théo Mary","doi":"10.1177/10943420221136848","DOIUrl":"https://doi.org/10.1177/10943420221136848","url":null,"abstract":"Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2× and 3.5× speedups on V100 and A100 GPUs, respectively.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"165 - 179"},"PeriodicalIF":3.1,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42887017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings","authors":"","doi":"10.1007/978-3-031-32041-5","DOIUrl":"https://doi.org/10.1007/978-3-031-32041-5","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"14 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88116938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Special issue: Introduction","authors":"M. Parsons","doi":"10.1177/10943420221150081","DOIUrl":"https://doi.org/10.1177/10943420221150081","url":null,"abstract":"The COVID pandemic has changed all of our lives and continues to do so. The prizes recognise outstanding research achievement toward the understanding of the COVID-19 pandemic through the use of high-performance computing. The winning paper, entitled 'Digital transformation of droplet/aerosol infection risk assessment realised on \"Fugaku\" for the fight against COVID-19', was submitted by a team from the RIKEN Center for Computational Science in Japan. [Extracted from the article]","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"3 - 3"},"PeriodicalIF":3.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45540807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance comparison of the A-grid and C-grid shallow-water models on icosahedral grids","authors":"J. Middlecoff, Yonggang G. Yu, M. Govett","doi":"10.1177/10943420221139509","DOIUrl":"https://doi.org/10.1177/10943420221139509","url":null,"abstract":"This study uses a single software framework to compare the CPU performance of Arakawa A-grid (NICAM) and C-grid (MPAS) schemes for solving the shallow-water equations on icosahedral grids. The focus is on high-resolution weather prediction. Performance analysis shows the simpler structure of the A-grid equations enables compiler optimization-based efficiency gains that the C-grid equations cannot match. Strong scaling runs at 3.5 km resolution show the A-grid is three times faster than the C-grid, enabling the A-grid to run at 50% higher resolution in only 15% more time. A performance comparison with the MPAS shallow-water model is included which demonstrates that our software implementation of the C-grid is robust and comparisons are fair.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"197 - 208"},"PeriodicalIF":3.1,"publicationDate":"2022-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44206468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acceleration of a parallel BDDC solver by using graphics processing units on subdomains","authors":"J. Šístek, T. Oberhuber","doi":"10.1177/10943420221136873","DOIUrl":"https://doi.org/10.1177/10943420221136873","url":null,"abstract":"An approach to accelerating a parallel domain decomposition (DD) solver by graphics processing units (GPUs) is investigated. The solver is based on the Balancing Domain Decomposition Method by Constraints (BDDC), which is a nonoverlapping DD technique. Two kinds of local matrices are required by BDDC. First, dense matrices corresponding to local Schur complements of interior unknowns are constructed by the sparse direct solver. These are further used as part of the local saddle-point problems within BDDC. In the next step, the local matrices are copied to GPUs. Repeated multiplications of local vectors with the dense matrix of the Schur complement are performed for each subdomain. In addition, factorizations and backsubstitutions with the dense saddle-point subdomain matrices are also performed on GPUs. Detailed times of main components of the algorithm are measured on a benchmark Poisson problem. The method is also applied to an unsteady problem of incompressible flow, where the Krylov subspace iterations are performed repeatedly in each time step. The results demonstrate the potential of the approach to speed up realistic simulations up to 5 times with a preference towards large subdomains.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"151 - 164"},"PeriodicalIF":3.1,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42735563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations","authors":"Will Pazner, T. Kolev, Jean-Sylvain Camier","doi":"10.1177/10943420231175462","DOIUrl":"https://doi.org/10.1177/10943420231175462","url":null,"abstract":"In this article, we present algorithms and implementations for the end-to-end GPU acceleration of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods described here allow for the construction of effective preconditioners for high-order problems with optimal memory usage and computational complexity. The preconditioners are based on the construction of a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for example, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and polynomial degree. For vector finite element problems in H (curl) and H (div) (e.g., for electromagnetic or radiation diffusion problems), a specially constructed interpolation–histopolation basis is used to ensure fast convergence. Detailed performance studies are carried out to analyze the efficiency of the GPU algorithms. The kernel throughput of each of the main algorithmic components is measured, and the strong and weak parallel scalability of the methods is demonstrated. The different relative weighting and significance of the algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively refined nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diffusion problem using all spaces of the finite element de Rham complex is illustrated.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"578 - 599"},"PeriodicalIF":3.1,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45709588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Qu, Rached Abdelkhalak, H. Ltaief, Issam Said, D. Keyes
{"title":"Exploiting temporal data reuse and asynchrony in the reverse time migration","authors":"L. Qu, Rached Abdelkhalak, H. Ltaief, Issam Said, D. Keyes","doi":"10.1177/10943420221128529","DOIUrl":"https://doi.org/10.1177/10943420221128529","url":null,"abstract":"Reverse Time Migration (RTM) is a state-of-the-art algorithm used in seismic depth imaging in complex geological environments for the oil and gas exploration industry. It calculates high-resolution images by solving the three-dimensional acoustic wave equation using seismic datasets recorded at various receiver locations. Reverse Time Migration’s computational phases are predominantly composed of stencil computational kernels for the finite-difference time-domain scheme, applying the absorbing boundary conditions, and I/O operations needed for the imaging condition. In this paper, we integrate the asynchronous Multicore Wavefront Diamond (MWD) tiling approach into the full RTM workflow. Multicore Wavefront Diamond permits to further increase data reuse by leveraging spatial with Temporal Blocking (TB) during the stencil computations. This integration engenders new challenges with a snowball effect on the legacy synchronous RTM workflow as it requires rethinking of how the absorbing boundary conditions, the I/O operations, and the imaging condition operate. These disruptive changes are necessary to maintain the performance superiority of asynchronous stencil execution throughout the time integration, while ensuring the quality of the subsurface image does not deteriorate. We assess the overall performance of the new MWD-based RTM and compare against traditional Spatial Blocking (SB)-based RTM on various shared-memory systems using the SEG Salt3D model. The MWD-based RTM achieves up to 70% performance speedup compared to SB-based RTM. To our knowledge, this paper highlights for the first time the applicability of asynchronous executions with temporal blocking throughout the whole RTM. This may eventually create new research opportunities in improving hydrocarbon extraction for the petroleum industry.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"132 - 150"},"PeriodicalIF":3.1,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43710430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. T. Henry de Frahan, Jonathan S. Rood, M. Day, H. Sitaraman, S. Yellapantula, Bruce A. Perry, R. Grout, A. Almgren, Weiqun Zhang, J. Bell, Jacqueline H. Chen
{"title":"PeleC: An adaptive mesh refinement solver for compressible reacting flows","authors":"M. T. Henry de Frahan, Jonathan S. Rood, M. Day, H. Sitaraman, S. Yellapantula, Bruce A. Perry, R. Grout, A. Almgren, Weiqun Zhang, J. Bell, Jacqueline H. Chen","doi":"10.1177/10943420221121151","DOIUrl":"https://doi.org/10.1177/10943420221121151","url":null,"abstract":"Reacting flow simulations for combustion applications require extensive computing capabilities. Leveraging the AMReX library, the Pele suite of combustion simulation tools targets the largest supercomputers available and future exascale machines. We introduce PeleC, the compressible solver in the Pele suite, and detail its capabilities, including complex geometry representation, chemistry integration, and discretization. We present a comparison of development efforts using both OpenACC and AMReX’s C++ performance portability framework for execution on multiple GPU architectures. We discuss relevant details that have allowed PeleC to achieve high performance and scalability. PeleC’s performance characteristics are measured through relevant simulations on multiple supercomputers. The success of PeleC’s design for exascale is exhibited through demonstration of a 160 billion cell simulation and weak scaling onto 100% of Summit, an NVIDIA-based GPU supercomputer at Oak Ridge National Laboratory. Our results provide confidence that PeleC will enable future combustion science simulations with unprecedented fidelity.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"115 - 131"},"PeriodicalIF":3.1,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45540928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Dennis, A. Baker, B. Dobbins, M. Bell, Jian Sun, Youngsung Kim, Ting-Yu Cha
{"title":"Enabling efficient execution of a variational data assimilation application","authors":"J. Dennis, A. Baker, B. Dobbins, M. Bell, Jian Sun, Youngsung Kim, Ting-Yu Cha","doi":"10.1177/10943420221119801","DOIUrl":"https://doi.org/10.1177/10943420221119801","url":null,"abstract":"Remote sensing observational instruments are critical for better understanding and predicting severe weather. Observational data from such instruments, such as Doppler radar data, for example, are often processed for assimilation into numerical weather prediction models. As such instruments become more sophisticated, the amount of data to be processed grows and requires efficient variational analysis tools. Here we examine the code that implements the popular SAMURAI (Spline Analysis at Mesoscale Utilizing Radar and Aircraft Instrumentation) technique for estimating the atmospheric state for a given set of observations. We employ a number of techniques to significantly improve the code’s performance, including porting it to run on standard HPC clusters, analyzing and optimizing its single-node performance, implementing a more efficient nonlinear optimization method, and enabling the use of GPUs via OpenACC. Our efforts thus far have yielded more than 100x improvement over the original code on large test problems of interest to the community.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"101 - 114"},"PeriodicalIF":3.1,"publicationDate":"2022-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45282216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}